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WORKSHOP AND SYMPOSIUM ON 
LONGITUDINAL ANALYSIS FOR COMPLEX 
SURVEYS 


Statistics Canada,Ottawa, Canada 


Statistics Canada’s XV” annual _ international 
methodology symposium will be on the topic of 
longitudinal analysis for complex surveys. In 
conjunction with this symposium, Statistics Canada and 
the Centre de recherches mathématiques (CRM), 
Université de Montréal, are sponsoring a workshop on 
this same topic. This workshop is one of the many 
events taking place during the CRM’s theme year in 
statistics. 


The focus of Symposium ‘98 is on recently developed 
methods in longitudinal data analysis. Emphasis will be 
given to the theory and application of longitudinal 
methods for data from complex surveys. The symposium 
will give participants an opportunity to meet colleagues 
who are involved in solving problems unique to the 
analysis of survey data, including David Binder, Wayne 
Fuller, Harvey Goldstein, Lisa Lavange, Jerry Lawless, 
Danny Pfeffermann, and J.N.K. Rao. 


We invite abstracts for papers related to the theme of 
Symposium ‘98. A non-exhaustive list of topics is 
included with this invitation. Papers concerning new or 
previously undocumented approaches, methodologies and 
applications are especially welcome. Academic 
researchers and practitioners from both the private and 
public sectors are encouraged to submit. 


Abstracts of 200-300 words, in English or French, along 
with the presenter’s name, affiliation, complete address, 
telephone and fax numbers and email address, should be 
sent to the address below. The deadline for abstracts 
is October 31, 1997. The final selection of papers will 
be announced by December 31, 1997. 


Submit abstracts to: 
Michael Hidiroglou 
Statistics Canada 
11th floor, R.H. Coats Building 
Ottawa, Ontario 


Canada K1A OT6 
Telephone: (613)951-4767 
Fax: (613)951-1462 


email: symposium98@statcan.ca 


Presenters must submit a draft paper, in English or 
French, by April 17, 1998, for the purposes of official 
simultaneous translation. The final version of a paper 
must be provided by June 30, 1998, in order to appear in 
the symposium proceedings. 
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Canada 


Statistique 
Canada 


May 19-22, 1998 


Non-exhaustive list of topics: 

Preparing/storing survey data for longitudinal analysis. 
Imputing for longitudinal data analysis. 

Weighting issues with longitudinal surveys. 

Gross flows - Methods of estimation and applications. 
Multi-level modelling techniques and applications to 
longitudinal survey data (including random effects 


models). 


Event history techniques and applications with survey 
data. 


Marginal modelling and applications with survey data. 


Software for applying longitudinal techniques to survey 
data. 


Causal analysis of panel data. 


For more information, please visit our web site: 


www.statcan.ca/english/conferences/ 
symposium98/index.htm 
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Methodology, June 1997 
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Statistics Canada 


In This Issue 


This issue of Survey Methodology contains articles on a variety of topics. Stafford and Bellhouse, in 
the first paper, present the basic building blocks to develop a comprehensive computer algebra for survey 
sampling theory. They show that three basic techniques in sampling theory depend on the repeated 
application of rules that give rise to partitions. The methodology is illustrated through applications to 
moment calculation of the sample mean, the ratio estimator and the regression estimator under the special 
case of simple random sampling without replacement. The machine application to the methodology 
described was done in the programming language Mathematica. 

Hinkins, Oh and Scheuren introduce a new strategy for analysis of data from complex surveys. They 
draw a sub-sample in such a way that the sub-sample may be considered to be a simple random sample 
from the original population and then apply standard procedures for IID data. They suggest repeating 
the procedure many times to recover information lost in sub-sampling the original sample. They show 
how to implement their approach for stratified element sampling, for one and two stage cluster sampling, 
and for two PSU per stratum designs. 

Nascimento Silva and Skinner consider the problem of variable selection for regression estimation. 
They develop an approach based on minimizing the mean squared error of the resultant estimator. They 
empirically compare their approach to others using data from a 1988 test of Brazilian census procedures; 
the proposed procedures have good bias and mean squared error properties. 

Eltinge and Yansaneh study the problem of formation of nonresponse adjustment cells. Within the 
general paradigms of estimated-probability and estimated-item based cells, they consider a variety of 
diagnostics for evaluating a set of adjustment cells. The diagnostic procedures include: comparison of 
estimates and standard errors for different numbers of adjustment cells; assessment of within-cell bias; 
assessment of cell widths relative to precision of estimated response probabilities; and comparisons of 
cell-based estimates to the unadjusted estimate. 

Kovaéevi¢ and Yung conduct an empirical study to compare variance estimation methods for measures 
of income inequality estimated from complex survey data. Variance estimation methods included in the 
study are: jackknife; bootstrap; grouped balanced half-sample method; repeatedly grouped balanced half- 
sample method; and a Taylor method based on estimating equations. After comparing relative bias, 
relative stability, and coverage properties of associated confidence intervals for a number of income 
inequality measures, they conclude that the Taylor method works best with the bootstrap method coming 
second. 

Humphreys and Skinner investigate the use of the instrumental variable estimation method for 
estimation of gross flows among discrete states. This approach may be useful when external estimates 
of misclassification rates are not available. They numerically illustrate their method using data from the 
U.S. Panel Study of Income Dynamics and the two states “employed” and “not employed”. They show 
that when measurement error is present, the unadjusted estimates can have considerable bias; this 
problem may be overcome by using suitable instrumental variables. 

Waksberg, Judkins and Massey discuss issues involved in oversampling geographical areas to produce 
estimates for small domains of the population in demographic surveys, in conjunction with household 
screening. An empirical evaluation of the variance reduction is presented, along with an assessment of 
the sampling robustness over time. Simultaneous geographic oversampling for estimation of several small 
domains is discussed. 

Losinger, in his paper, proposes a modified random groups standard error estimator for data from the 
U.S. Decenial Census sample. The usual random groups estimator has two undesirable properties for 
binomial variables: estimates of standard error for the “yes” and “no” responses are not equal; if all 
respondents answer “yes” the estimated standard error is not equal to zero. The essential idea of the 
proposed modification is to apply a ratio adjustment to each subgroup estimate so that subgroup estimates 
of population agree with the total. 


In This Issue 


Finally, Zeelenberg gives a simple technique, which exploits the use of differentials, to linearize 
design-based, nonlinear estimators. Ultimately, the linearized expressions allow one to obtain simple 
Taylor-based expressions for the variances of the nonlinear estimators. He illustrates the technique using 
two examples: the regression coefficient estimator and the regression estimator. 


The Editor 
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A Computer Algebra for Sample Survey Theory 


J.E. STAFFORD and D.R. BELLHOUSE' 


ABSTRACT 


A system of procedures that can be used to automate complicated algebraic calculations frequently encountered in sample 
survey theory is introduced. It is shown that three basic techniques in sampling theory depend on the repeated application 
of rules that give rise to partitions: the computation of expected values under any unistage sampling design, the 
determination of unbiased or consistent estimators under these designs and the calculation of Taylor series expansions. The 
methodology is illustrated here through applications to moment calculations of the sample mean, the ratio estimator and 
the regression estimator under the special case of simple random sampling without replacement. The innovation presented 
here is that calculations can now be performed instantaneously on a computer without error and without reliance on existing 
formulae which may be long and involved. One other immediate benefit of this is that calculations can be performed where 
no formulae presently exist. The computer code developed to implement this methodology is available via anonymous ftp 
at fisher.stats.uwo.ca. 


KEY WORDS: k-statistics; Partitions; Product moments; Ratio and regression estimators; Symbolic computation; Variance 


estimation. 


1. INTRODUCTION 


In classical sampling theory two general problems concern 
us. These are the determination of an unbiased estimator of 
a parameter 9 and the calculation of moments of 6, the 
estimator of 0. 

The basic method to handle expectations and unbiased 
estimation is to operate on sample and population nested sums 
respectively through the inclusion probabilities, either single 
or joint probabilities as appropriate. A nested sum is a sum 
over the range of one or more indices such that each term in 
the sum depends on indices of different value. An unbiased 
estimator of any population nested sum is the associated 
sample nested sum with the quantity under the summation 
divided by the appropriate inclusion probability. Similarly the 
expectation of any sample nested sum is the associated 
population nested sum with the quantity under the summation 
multiplied by the appropriate inclusion probability. 

In sampling theory, as well as several other areas of 
statistics, many algebraic calculations depend on a partition 
of some kind. With particular reference to sampling, Wishart 
(1952) showed that basic moment calculations under simple 
random sampling without replacement relied heavily on 
partitions. Here we will use partitions to express the sum of 
products of means or totals as linear combinations of nested 
sums and vice versa. 

In the results presented here we consider the situation in 
which @ and 6 can be expressed as smooth functions of 
means or totals, population or sample as appropriate. There 
are two possibilities: the smooth function under consider- 
ation can be expressed as the sum of products of means or 
totals, or the smooth function cannot be so expressed. When 


Aa 


the second possibility is operative the function @ is first 


linearized through a Taylor expansion and 0 is expressed as 
the root of an estimating equation. We use integer partitions 
to obtain terms in the Taylor linearization of a function or for 
the root of a function. The end result is that 8 and 6 can be 
expressed, either exactly or approximately, as the sum of 
products of means or totals. These in turn can be expressed 
in terms of linear combinations of nested sums and vice versa. 
Estimation of @ or calculation of the moments of 6 is then a 
three step procedure: (a) Express an estimating equation for 8 
or the estimator 6 as the sum of products of means or totals, 
using Taylor linearization when necessary. (b) Transform 
the expression obtained in the first step to a linear combina- 
tion of nested sums. Then operate on these nested sums to 
obtain unbiased estimates or expectations as appropriate. 
(c) Transform the resulting nested sums in the second step 
back into a sum a products of means or totals. 

The key to automation of sampling theory results is the use 
of partitions. In general, whether these partitions are simple 
partitions, like that of an integer, or more complicated, like a 
full partition, each results from the repeated application of a 
fundamental rule. When the rule is identified, the possibility 
of automating a calculation arises. Seemingly unrelated 
formulae can result from the same fundamental rule and one 
computer algebra tool can be constructive in implementing 
many different calculations. 

The notation used in the paper is outlined in §2. A 
discussion of expectation operators is given in §3. The 
concept of partitioning is reviewed in §4 and a rule is 
provided which leads to a simple recursive method for the 
enumeration of partitions. Integer partitions and Taylor 
linearization is discussed in §5. It is shown in §6 how the 
enumeration of partitions leads to the automatic calculation of 
expected values of products of sample means and k-statistics 
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and to the derivation of unbiased estimators of products of 
finite populations means and k-statistics. Also in this section 
we apply the methodology to ratio and regression estimation. 

Automation of these calculations and derivations will 
provide procedures which can be performed instantaneously 
and without error on a computer. Also, the reliance on 
formulae which may be long and involved is eliminated. A 
great deal of hand written algebra can be avoided. All 
computer code for the implementation of the methodology 
described here was written in the symbolic package 
Mathematica 2.0 which was installed on an IBM Risc 6000 
with 64 megabytes of RAM. Itis available via anonymous ftp 
at fisher.Stats.uwo.ca. Although we use Mathematica, imple- 
mentation in other environments such as Maple, Macsyma or 
Reduce is no doubt possible. For example, Kendall (1993) 
describes a system, implemented in Reduce, for the 
identification of invariant expressions. For a complete review 
of computer algebra in probability and statistics prior to 1991, 
see Kendall (1993). 


2. SOME NOTATION 


Consider a finite population of size N. A measurement of 
interest y, is made on each unit j,j e U={1,...,N}. In 
addition a single auxiliary variable x, OF possibly a P x 1 
vector of auxiliary variables x, may be taken on the units. 
The p-th entry of this vector x, is Xp? where p = 1.,..., P. 
Several kinds of finite population parameters may be defined 
on the measurements Vj X,, OF X, for j = 1,...,N. We denote 
a finite population parameter of interest by 8. Often 6 can be 
expressed as a smooth function of finite population means, 
central moments and k-statistics. For convenience here we 
will deal only with means and k-statistics. Note that finite 
population variances and covariances are also second order 
k-statistics. 

Not all V population elements are observed. Suppose that 
a sample s of size n is chosen from the population U by some 
sampling scheme. An estimator of 0, given by 6, is a smooth 
function of sample means and sample k-statistics. 

In order to avoid much cumbersome summation notation 
we adapt the index notation of McCullagh (1987) to our 
purposes. For any / the vector x, contain P entries so that 
each of these x-variables may be associated with one of the P 
indices. Suppose {i,,...,i,,} is a subset of m of these P 
indices. In our adaptation of McCullagh’s notation, Xi is 
now what we called the vector x. Products of these indexed 
quantities become multidimensional arrays. For example the 
product Xi Xi Xiyy is a three-dimensional array of dimension 

xP: 

Let M denote a finite population mean. The argument of 
M shows the structure of the summand in the mean. For 
example, My) = py y¥,/N and M(yy) or equivalently 
My*) = eyy; /N. In index notation, for example, 


Mos 8,8,) De nyiyaylN a 


is a three-dimensional array. An element of this array is the 
mean of products in one of the permutations of the P elements 
taken three at a time in x, where up to three of the elements 
may be alike. The (p,g,r)-th element of this array is 
yep py where p,g,r=1,...,P. The sample mean is 
denoted by m so that, for example, 


m(x, x, x; ) = > X, Xi Xiy!™ (2) 
JES 


For the purpose of making asymptotic expansions, since 
the variance of a given estimator 6 will be O(n~'), we define 
a standardized variable for 6: it is the original variable 6 
centered about its expectation and scaled by 1 / yn Laks. 


2(6) = (6- E()} yn. (3) 
When necessary we use the summation convention of 
McCullagh (1987), where subscripts repeated as superscripts 
indicate implicit sums over that index. As a particular 
example, on assuming that the x, are independent and 
identically distributed vectors from some infinite super- 
population, multivariate superpopulation moments can be 
obtained through the moment generating function which is 
expressed in this convention as 


co h : 
MGF) =1+)7 u,..,, [[ evn, (4) 
h=1 j=l 


where 
7. MGF 
SES Ol,-0 (5) 
By definition, the relationship between the moment generating 
function and the cumulant generating function is determined 
by the rule MGF(f) = exp {K()}, where 


© h é 
KOS ee om (6) 
h=1 j=l 


is the cumulant generating function, where 


3 
= ——_—— 1K ; 
dt, Olr-o 


The finite population k-statistics, denoted by K(-), are 
defined as the unbiased (under the i.i.d. superpopulation 
model) estimators of the associated model cumulants. The 
number of arguments in K separated by commas denotes the 
order of the k-statistic. For example, the third order k-statistic 
K(x, x; »%,,) is the model-unbiased estimate of (6), where 


K(x, om, 2 ) = aT eed 
eutie tats (Mia Mge 2) 
x D7 bx, - Me, Ike, - Me, Ik, - M&, 1. 
jeu 


In the univariate case finite population k-statistics are 
described in Wishart (1952). In particular K(y,y) and 
K(y,y, y) in the current notation are K, and K, in Wishart’s 
(1952) notation. The sample k-statistics, denoted by k(-) with 
the appropriate arguments, are defined as the unbiased 
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estimators under simple random sampling’ without 
replacement of the associated finite population k-statistics. As 
in Wishart (1952) the sample k-statistic can be obtained from 
the population k-statistic upon replacing N by n and upon 
taking the sum over /€s rather than all units in the finite 
population. For example, 


2 ai eb SE 
(n- 1)(n - 2) 
xD) bx, - mG, Ike, 


jes 


K(x, , X)»%),) = 


~ m(x, )ILx,,,- m(x,)}. 


Note that if a comma is not present in the population or 
sample k-statistic, then the product of elements which appear 
together is required. For example, K(xy) is the first order 
finite population k-statistic of a new variable which is the 
product of the measurements x, and y; for j = 1,...,N; K(x,y) 
is a second order k-statistic, in particular the finite population 
covariance between x and y. 


3. OPERATORS 


The expectation operator E can be applied directly to any 
sample nested sum to obtain a finite population nested sum. 
Likewise an unbiased estimator of any finite population 
nested sum is a sample nested sum. In terms of triple nested 
sums, for example, 


ap x; Haan Xi i? ye TeX iy TeRat (8) 


and 


» x daa I 2h od Nid ink™ Mg (9) 


where J, is the index set {j, k, /} such that j + k + / and where 
Tol is a joint inclusion probability. Parallel expressions may 
be established for with replacement sampling schemes. 
Note that m will be unbiased for the associated M under 
simple random sampling without replacement. In general for 
any sampling design of fixed size n, 
Elm, x, Jl = ~ Mex, XX m2) 


ly 


and 


n 
M(x, x.x,)~— m(x. x. x, /n 
Del A ry haa ral co 


where M(x, x, Xi ) and mx, x, ) are defined in (1) and (2) 
respectively. 

The whole operation of finding expectation of an estimator 
6 or of finding an unbiased estimator for the parameter of 0 
may be represented schematically as 


Pie ore Nhe (10) 


where )']] denotes the sum of products and )')° denotes a 
sum of nested sums. If @ or 6 can be expressed as a YJ] 
quantity, .e., a sum of products of means, then finding an 
unbiased estimator of @ or moments of 6 reduces to following 
the schema in (10) and applying the appropriate operator, such 
as those given in (8) or (9), to )')., the middle step in the 
schema. If 8 or 6 are smooth functions of means but cannot 
be expressed directly as |] quantities, then an initial step is 
required before applying the schema in (10). For 6 the initial 
step is to obtain a Taylor expansion of 6. For @ the initial 
step is to obtain an estimating equation and then to solve it for 
the parameter. 

We illustrate the schema in (10) by considering the simple 
case of finding E[{m(x, WI under simple random sampling 
without replacement. ‘The first operation is to express 
{m(x, )}? i in terms of nested sums. In particular, 


a= — et ee ay (11) 


n> Jes n° jtkes 


This is the } |] = )). step. Now the expectation operator can 
be applied to YY. On applying inclusion probabilities 
1, = n/N and Ti =n(n-1)/[N(N-1)], the expectation 
operation on (11) yields 


ue AC p=AY) _ 
59 ty SS See 6.) 28 9s sae 12 
” I D MN - 1) oy is ik ( ) 


Now the YY) = )'J] step is applied. On expressing the nested 


sum in (12). as the sum of products, in particular pie ek=1 1 yi ko 


N 
ya edeiby eg 1%), j aay 1%, ,%;,) > the third operation yields 


Mn-1) 2, _N-n 2 
Mc, + : 
(W-Tyn | x * aD M(x;). (13) 


El{(m(x,)}7) = 


In (13), Mi, A K@, ) and M(x; *) = [N/(N - 1)) KG, ,x x Aye 
K(x, DRG; ®) so that (13) can be reexpressed as 


E(m(x, )°) = (K(x, )}° + (Nn) K(x, 5x, MON). (14) 


Likewise, following the schema in (10), the operations for 
finding an unbiased estimator of, for example, {M(x, Vs 
similar to (11), (12) and (13). The estimand {M(x, oP 
expressed in nested sums similar to (11). These sums will ee 
nested finite population sums. Similar to (12) the inclusion 
probabilities are applied. In this case the finite population 
sums are replaced by sample sums and summand is divided by 
the appropriate inclusion probability. Finally, similar to (13) 
the resulting nested sample sums are expressed as products of 
sums. 

Each of the elementary operations to obtain an expected 
value through equations (11), (13) and (14), or to obtain an 
unbiased estimator, can be carried out using partitions. These 
operations are: expressing sums of products as nested sums 
and vice versa, and expressing means in terms of k-statistics 
and vice versa. 
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4. PARTITIONS AND FUNDAMENTAL 
PROCEDURES 


Central to the automation of all algebraic calculations 
considered here is the notion of a partition. Partitioning as a 
focal point gives the appearance that the automated methods 
presented here are nothing more than an integer partition or a 
partition of an index set. While we assume that a partition of 
an integer is understood, a full partition requires a more 
formal definition. 

Consider a set of m indices J, = {i,,...,/,,}. A single 
partition P, of J, divides the m indices into k < mmutually 
exclusive and exhaustive subsets or blocks of /,,. We write 
P= (b,|6,|..-| ,), where the 5, ..., b, are the blocks of /,. 
P_, is unique up to permutations of indices within the blocks 
b,. The block 5, is comprised of a subset of the indices of 
[,,- Elements within a block may be constrained to an 
alphabetical ordering and the blocks themselves may be 
ordered such that leading elements of each block are ordered 
alphabetically. This ensures the uniqueness of the partition 
P,,- In this case P, would be called a standard ordered 
partition. Ordering the partitions in this manner does not 
offer any computational advantage and hence is not a 
requirement in what follows. The full partition of J, is the 
set ©, of all single partitions P, of J. 

Now we may identify the full partition of 7, in an 
algorithmic way via an inclusion-exclusion rule. 
ty Let Oat. 

ii. An inclusion-exclusion rule determines the contribution 
to ©, by a partition P,_, € ®,_,. In the inclusion part of 
the rule, the new index /, is added as an element in turn 
to each of the blocks 5,, ..., b, which comprise P,_,. If 
P,_, has k blocks, k partitions for ©, are created. In the 
exclusion part of the rule a new block containing the 
single index /, is added to P,_, 

For example, the full partition of J, = {7,, i,, i,} is given by the 


steps 
fe SRG 
jis ©, = kG dys Gli) (15) 


Wi Oe = Ad 1,), (Li baht), Cabal talk adler lbs iba) he 
From step (i) to step (ii) the inclusion rule results in the 
partition (7,7,) and the exclusion rule results in (7, | i,). From 
step (ii) to step (ili) the inclusion rule results in the creation of 
the partitions (i, i,i,), (i,7,| 7), and (i, | i,7,). The exclusion 
tule yields the partitions (i, i, | 7,) and (i, | 7,| i). This type of 
construction is easy to automate since it depends on a simple 
rule. Details of automating the partition of indices into full 
partitions and complementary set partitions are given in 
Stafford (1996). 

Consider, for example, the classical problem of writing the 
model moments of the random vector x, in terms of its 
cumulants. As in (5) we can identify the ie th moment array 
by differentiating MGF(#) in (4) A times and setting ¢ equal 
to the zero vector. The result is the A-th coefficient in the 
expansion of MGF(f). Equivalently we can apply the same 
operation to exp{K(}. In this case the result is a sum that 


depends on the coefficients of K(#) in (6). For example, we 
may write the first three moments in terms of cumulants as 
follows: 


H, « bys 
Nites Dy inital ne 
iene = Kee opts Ree Rea eI Gee eI 


Nig Nloly Le) 5 Why Iy 4 In13 uy In I 


‘Now in each case the result is a sum over the full partitions 


given in (15). These partitions arise since the multiplication 
rule for differentiation mimics the inclusion-exclusion rule for 
the enumeration of the full partition. 

The above result is applied to sampling theory where we 
consider the problem of finding the expected value of a 
product of sample sums. The calculation requires expanding 
the product of the sums to identify terms where the finite 
population expectation operator will behave differently due 
to differences in the values of inclusion probabilities and joint 
inclusion probabilities. 

For example, the product of sums es: 
can be expressed as 


Le %iy% inf dh] is > Mit iy x, k ‘ ms Law ah ual 
€s jtkes Es 
is x; den Dy x; iv Xi ey (16) 


jtkes jtkeles 


es" iy] jes*inj jes iy 


The result corresponds to the full partition of the indices 
I, = {i,,1,,i,} given by ®, in (15). The order of the partitions 
in ©, is the same as the order given for the terms in (16). For 
each partition in ©,, the variables in the same block have the 
same second index in the appropriate term in (16). For 
example, the partition (i,i,|i,) corresponds to the term 
Lijekes*ij%, ,x,, in (16). Each term in the result can be 
identified by a partition of J, and each partition determines 
the manner in which the oeted value operator will behave. 

In general, we want to expand products of the form 
He; via _j» Where the product is taken over the elements i, 
of the index set T= {iy + t,,}- As in (16), the product can 
be expressed in terms of the full partition of J. This is 
because the iterative rule for expanding a product of sums 
mimics the inclusion-exclusion rule. 

The expansion of the products of sums through partitions 
is demonstrated inductively as follows. Assume the product 
of the first ¢- 1 sums can be expressed as a sum over the full 
partition of the index set J, = {i,,..../,_,}, in particular 


He ee Xp. (17) 


In ou the term 7 he rf is the sum identified by the partition 

=(b,|...[5,), ie = sibs .t- 1. The blocks 5, indicate groups 
A varebles with the same second index and so P,_, induces 
an index set J, = {j,,....j,} of second indices. We can 
express X, as 
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Xp > » (II; (18) 
Jit #F,ES \ JES, 

where X, is a product of x's defined by the block b, that all 
have the same second index. To illustrate (18), consider, for 
example, the third term of (16). Here P,_, =(i,i,|/,) and 
J, = {j, k} so that in (18) the sum is taken Beit +kes and the 
anitcaheaee of the product are X, = . and X, Xie 
Returning to the general discussion, nea ides side a (it) 
is multiplied by ee: es, the product of the first ¢ sums is 


obtained. Now the product Xp _dyes* ‘es¥),, Can be expressed as 


a 


Jy #FES 


k 
Ye tral Bl X,, is 


I=] JES, 


> (H%5.,.4) (19) 


Jy FADES \ IE, 


The first term in (19) corresponds to the inclusion part of the 
rule and the second term in (19) corresponds to the exclusion 
part of the rule. When (19) is summed over all P,_,¢ © _,, the 
result will be a sum over the full partition of the fist t indices 
given by /,, 7.e., the sum over all P,€®,. 

Once the product of sums, []”” rye ;,; 18 expanded into 
a sum of nested sums, the finite population expected value 
operator can be applied to each term so that the expected 
value of this product can be obtained. The expected value 
under simple random sampling without replacement of the 
product of sums results in a weighted sum of nested sums, 
with each sum taken over the finite population. We then wish 
to evaluate these nested sums. 

In general we wish to evaluate the nested sum )) 1X, 
where J, is the index set {/,,...,/,}. The sum is taken over all 
j,#..#j, with each j,=1,...,N. The summand Y, is the 
poe. Xi Xinot Mini” In ‘the special case when t=3 or 
J, = (j,k, 1} the nested sum can be written in terms of full sums 
as 


N N 
me Ya Se Yin = DS Dickies 
J; Jtkel=) jekel=1 
N N N 
Dy Jobs is) » ig de xy Xi Pie nfo 
Ve Ws Jal j=l el 
N N N N N 
De Sip ig ‘! dy apy Leis tage (20) 
vel YP j=l jad yA 


Note that the full sums in the rightmost expression in (20) 
result from the full partition ©, in (15). The order of the 
partitions in §, is the same as the order of the terms on the 
right of (20). The subscripts on the right of (20) denote the 
block membership in ©,. For example, the partition (i, i, | i,) 
corresponds to the term }""_ ee ides in (20). Note also 
from (20) that the determination of a nested sum is 
complicated by the additional determination of the 
appropriate coefficients of the full sums. 

In general the evaluation of finite population nested sums 
results from the repeated application of the rule 


This expression mimics the inclusion-exclusion rule where the 
first set of sums on the right follows the exclusion part of the 
rule and the second set follows the inclusion part of the rule. 
Repeated application of (21) yields 


& (it,)-2 oor" 


Deel PEOy 


{11 oso (HL) 


where |J,|,|P,| and | 5,| are the number of indices in J,, 
the number of blocks in the single partition P, and the 
number of elements in the block b, respectively. 


5. INTEGER PARTITIONS AND TAYLOR 
LINEARIZATION 


Suppose that under some sampling design an estimator 6 
of a parameter 9 is of interest. The methodology described 
in §§2 to 4 may be used in moment calculations for 6 or to 
find unbiased estimators of these moments. Only in the 
simplest cases can this methodology be applied directly. 
Typically 6 must be linearized so that it becomes a 
polynomial function of sample means or sums which are 
O Key) random variables with respect to the sampling design. 
Once 6 is linearized in this way the methodology of §§2 to 
4 is applicable. 

The objective of the linearization is to write 6 as an 
asymptotic expansion where terms descend in order by 1 / yn 3 
specifically 


6 = 6, +6,/Vn + 6,/n +..., (22) 


where 6, is the coefficient of the n~"? term. Typically 6 is a 
product of quantities that can also be expanded in this way. 
For example, if the measurement of interest is y and one 
auxiliary variable x is present then 8 might be M(y) and the 
auxiliary information available is M(x) as well as x, for jes. 
Then 6 = M(x)m(y)/m(x), the simple ratio ponion is a 
product of three quantities M(x),m() and 1/m(x) all having 
asymptotic expansions of their own. The expansion of M(x) 
is itself. From (3) the expansion for m(y) yields . 
Mv) + z(m(y))/yn. The expansion for 1/m(x) results from 
(3) and then applying a Taylor expansion to 
[M(x) + 2(m(x))/n}! 
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In general any expansion of a function with sufficient 
regularity can be found if operators are defined to expand a 
function, say g(é) where é is itself an expansion. We are 
interested in expanding functions of the form 


2 


g(é) =I] 6, (23) 
j=l 


where €, itself has the expansion )’~ -0%2 “2 In linearizing p 
the basic requirement is to define an operator that returns 6, 
in (22). The efficiency of this operator derives solely from a 
rule for expanding functions of the form given in (23). The 
calculations required are functions of 1 integer partitions. For 
example the 1/n term in the expansion of 9 -1 6, 18 


+ 
© 1202803 * €1%22%03 + Po1%o2%23 * 11812813 


11% 02813 * &01%12%13- (24) 


Collecting first indices for each term in the sum results in a list 
in which each element sums to 2: {(2,0,0), (0,2,0), (0,0,2), 
(1,1,0), (1,0,1), (0,1,1)}. On noting that the order n~"? term 
in any expansion é. is actually the (i + 1)-th term in the sum 
ye -0€ eve may modify the list derived from (24) so that 
entries identity the position of terms in a sum. The 
modification is to add 1 to each index value in the list. In the 
list derived from (25) this results in all partitions of the integer 
5 into’ 3: blocks {(3) Et) Carly Cilio), (2.2, Lali): 
(1,2,2)}. In general, the i-th term in the expansion of Ta é 
or é , where p is a positive integer, is a sum over all 
partitions of the integer i + p into p blocks. Consequently, 
using this methodology any term in the expansion of, for 
example, the ratio estimator can be found. 

We illustrate this technique with ratio and regression 
estimation. The ratio estimator is given by 


M(x) m(y)/m(x) (25) 


and the regression estimator by 


+ DIKE) - kK] (26) 
K(x,x) 


xX 


k(y) + B[K(x) - k(x)] = ky) + 


in the notation of k-statistics. 
On using (3) the ratio estimator (25) may be expressed as 


-1 
M(x M(y) + 221 Moxy + 22) 27 
| a ‘i Ne 


The expression in (27) may be expressed in terms of (24) with 
p=3. The first term in (27) is the expansion baie enn 2 with 
€y, = MC) and e,, = é,, =* =0. The first term in square 


brackets in (28) is the expansion Sean * where 
ee = MQ), €1) = 2(m(y)) and e,, =e, =-~ - 0. The second 
term in square brackets is the expansion )’,_,e,,7 “2 Where 


é,,=(-1)' {z(m(y))}'/{M(x)}'"!. To get the 1//n term in the 
expansion of (27), in which case i = 1 and p = 3, we need to 
find the integer partitions of 4 in blocks of 3. This yields the 
partitions (2,1,1), (1,2,1) and (1,1,2). On subtracting 1 from 
each index value in the list we obtain the list (1,0,0), (0,1,0), 
(0,0,1). Therefore the required term in the expansion is 
(2,1 202203 * 201212203. * 201 %02 21 yin or equivalently 
[z(m(y)) - My) z(m(x))/M(x)\/¥n. The 1/n term is obtained 
from (24) which reduces to 


[My 2(x)}7/{ M(x) }? = 200) (M(x) Vn. 


The regression estimator in (26) may be expressed as 


Ky) + 2, 


n 


Rec) poe 


ie 
x|K(x,x) + ee 


n 


“ 2(k(x)) 
yn 


using (3). The terms in the square brackets in (28) can be 
expanded in a similar fashion to the ratio estimator. In this 
case the terms in the expansions become: é), = K(x,y), 
é,, = 2(K(x,y)) and e,, =e,, =--- = 0; e,, = (-1)'{z(AG@,x))}'/ 
{K(x, x)}'*! for i=, 1, 2, ...; and e,,=0, e,, = z(k(x)) and 


€y, = €3, =" =0. Consequently, the I//n term in the 
expansion of the terms in the square brackets in (28) is 
_ K(x, y)z(k(x)) 
K(x,x)~n 


and the 1/n term is 
1} z(k@, y)) _ K(x, y)z(k(x, x)) 
K(x, x) K(x, x) 


These were obtained by the same argument that was used in 
the ratio estimator. 


2(k(x)) . 


6. MACHINE APPLICATIONS TO THE 
CALCULATION OF EXPECTED VALUES OF 
SAMPLE STATISTICS AND THE DERIVATION OF 
UNBIASED ESTIMATORS 


Since the machine application to the methodology 
described in §§3 to 5 was done in the programming language 
Mathematica we give a brief description of the operation of 
Mathematica. Then we describe the operators that were 
developed in Mathematica to provide a computer algebra for 
survey sampling theory. 

Programming in Mathematica is carried out using 
expressions of the form h[e,, e,, ...] where the object h is 
called the head of the expression and the e's are the elements 
of the expression. We have developed a number of machine 
expressions in Mathematica in the form of hAle,, e,, ...] for 
operators which we apply to developing a computer algebra 
for sampling. All of these operators have been devised to 
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handle vectors as their arguments as well as scalars. There 
are four basic operators: EV[-] for expected value, Cum[-] 
for calculation of cumulants, UE[-] for unbiased estimator, 
and Aexp[-] for asymptotic expansion. There is also an 
operator to switch from notation using k-statistics to notation 
using means and vice versa. 

The expected value operator EV[-] on sample statistics 
combines and carries out in Mathematica the three basic 
operations shown in the schema in (10). EV[-] contains two 
arguments, the first is the expression for which the expected 
value is to be obtained and the second is the sampling design 
which defines the inclusion probabilities. The application in 
Mathematica of EV[-} to m(x, dm, Jn, ) under simple 
random sampling without replacement yields 


(N - n) (K(, , x, KX; ) 


K(x, ) K(x, ) K(x, ) + 
(x, )K(, AC, ) oi 


K(x, x, AQ, ) + KO; K(X, , x, )) 


Nn 
(N* - 3Nn + 2n*) K(x, ,x,,x,) 


N?n? 


in the simplest expression of the output. Note that the result 
is a function of the full partition of {i,,i,,i,}. If the operand 
is changed to {m(x, aN MC, D3 x {m(x, ors Mc, D3 x 
{m(x, ei Mc, PE application of EV] yields 


(N? - 3Nn + 2n?) K(x, x, ,x,) 
N*n? 
which was obtained by Nath (1968) for particular values of the 
indices i,,i, and i,. In fact, the results in Nath (1968, 1969) 
for the products of three and four means and the exact results 
in Raghunandanan and Srinivasan (1973) for up to a product 
of eight means can all be reproduced automatically with the 
software that has been developed. 

To this point the sampling design used in each of the 
examples has been simple random sampling without 
replacement. Results under general sampling designs can be 
obtained. We illustrate these results for the operator Cum[-] 
which is used to obtain the cumulants of an estimator. Note 
that the second cumulant for an estimator is also the variance. 
The operator Cum[-] has three arguments. The first is an 
expression for the estimator, the second is the order of the 
cumulant and the third is the sampling design. Under general 
sampling designs, estimators can be expressed in terms of 
YJ] in the schema given by (10) and the )JJ can be 
expanded to obtain )’)’, the middle term in (10). There is, 
however, no general simplification to obtain the final term in 
(10). This is illustrated with the Horvitz-Thompson estimator 
of Mi) given by (n/N)m(j/n) in the notation developed 
here. Application of the operator Cum[-] under a general 
sampling design to obtain the third cumulant of the Horvitz- 
Thompson estimator yields 


(m0 ou 9) 


where, for example, the term 7,, is the single inclusion 
probability 7,,. 

The operator Aexp[-] has two arguments, the function for 
which the expansion is required and the order of the 
expansion. This operator is used in combination with the 
EV{-Jor Cum[-] operators to obtain approximate 
expectations or cumulants. This is illustrated in the case of 
the multiple linear regression estimator under simple random 
sampling without replacement. When there are g covariates 
the resulting regression estimator is given by 


k(y) +b, LKQ") ~ ke") (29) 


using index and k-statistics notation. In (29) the coefficient 
5, is the vector resulting from the product k(x, , y) ik(x', x, ) 
in index notation, where the q X q array ik(x,,x, ) is the 
inverse of the g x q array given by k(x, , He ar Similarly we 
will use IK(x, , x,) to denote the inverse of the finite 
population array K(x, x,). Derivation of the mean square 
error of (29) requires Taylor expansions of the elements of 5, 
followed by the appropriate moment calculations and 
collection of terms. The Mathematica command to obtain the 
approximate variance of (29) is obtained by first applying 
Aexp[-] to (29) with 2 as the order in the expansion. Then the 
operator Cum[-] is applied to the result with the following 
arguments: the result from the asymptotic expansion as the 
estimator, simple random sampling as the design and 2 for the 
order of the cumulant. This yields 


(N-n)K(y,y) ON * MK@,, KG, YIKO 1 x2) 
a nor ears. ee 


in index notation as output. 

Estimation is achieved through the operator UE[-] which 
has two arguments, the estimand and the sampling design. 
For example, application of UE[-] to {M(x)}? under simple 
random sampling yields 


(Nn){k(x)}* + (N=n)k(x, x) 
Nn 
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If the estimand cannot be expressed as a sum of nested sums, 
but instead can be expressed as the root of an estimating 
function, then UE[-] obtains a consistent estimator. 


7. DISCUSSION OF FUTURE WORK 


The basic building blocks to develop a comprehensive 
computer algebra for survey sampling theory have been given. 
The foundation of this algebra is based on the enumeration of 
partitions. Fundamental operations under partition enumer- 
ation include the evaluation of nested sums and Taylor series 
expansions. Once these operations have been completed then 
expectations of sample statistics can be calculated or unbiased 
estimators of population quantities can be determined. 

The next phase in this work is to extend the unistage 
results to multistage and multiphase sampling. In both multi- 
stage and multiphase sampling the problem reduces to the 
computer evaluation of multiple sums under an expectation 
operator or the determination of an unbiased estimator of 
multiple finite population sums. The problem of multistage 
sampling is currently under investigation. Another current 
area of inquiry is to extend the algebra to superpopulation 
models. 

Once the basic algebra is in place then research problems 
involving algebraically complex sampling formulae can be 
easily investigated. 
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Inverse Sampling Design Algorithms 


SUSAN HINKINS, H. LOCK OH and FRITZ SCHEUREN' 


ABSTRACT 


In the main body of statistics, sampling is often disposed of by assuming a sampling process that selects random variables 
such that they are independent and identically distributed (IID). Important techniques, like regression and contingency table 
analysis, were developed largely in the IID world; hence, adjustments are needed to use them in complex survey settings. 
Rather than adjust the analysis, however, what is new in the present formulation is to draw a second sample from the original 
sample. In this second sample, the first set of selections are inverted, so as to yield at the end a simple random sample. Of 
course, to employ this two-step process to draw a single simple random sample from the usually much larger complex survey 
would be inefficient, so multiple simple random samples are drawn and a way to base inferences on them developed. Not 
all original samples can be inverted; but many practical special cases are discussed which cover a wide range of practices. 


KEY WORDS: Finite population sampling; Inference in complex surveys; Resampling. 


1. INTRODUCTION 


The development of modern survey sampling is an 
extraordinary achievement (Bellhouse 1988; Hansen 1987; 
Kish 1995). The very richness in that development may have 
had the effect, though, of isolating survey sampling from the 
rest of statistics — where it is the richness of models that is 
given emphasis. In fact, it is a well-known commonplace that, 
in the main body of statistics, sampling is often disposed of 
by assuming a sampling process that selects random variables 
such that they are independent and identically distributed 
(IID). 

Important techniques, like regression and contingency 
table analysis, were developed largely in this IID world; 
hence, adjustments are needed to use them in complex survey 
settings. Indeed, whole books have been written on this 
problem (Skinner, Holt and Smith 1989); and much time and 
effort have been devoted to it in software (like SUDAAN or 
WESVAR PC) specially written for surveys (See also Wolter 
1985). With all that has been done already, can something 
more of value be added? We think we may have a 
contribution to offer on how to deal better with the “seam” 
which currently exists between IID and survey statistics. 

Organizationally, the paper is divided into four sections. 
This introduction is Section 1. In Section 2 and 3 a general 
problem statement is provided and several “resolutions” are 
offered in a few of the better known designs. Our approach is 
to resample the complex sample to obtain an easier to analyze 
data structure. Specifically, we cover stratified element 
sampling, one and two-stage cluster samples, plus the 
important two PSU per stratum design (Section 2). Because 
any given resample is unlikely to contain all the information 
in the original survey, we look at what happens when the 
original complex sample is repeatedly resampled. A concrete 
illustration of our ideas is also given in Section 3; this has 


been taken from our practice and is based on a highly 
stratified Statistics of Income (SOI) sample of corporate tax 
returns (e.g., Hughes, Mulrow, Hinkins, Collins and Uberall 
1994). In a concluding section (Section 4), we discuss a few 
applications and some next steps needed for our still embry- 
onic ideas to grow more useful. 


2. PROBLEM STATEMENT AND POSSIBLE 
“RESOLUTIONS” 


2.1 Motivation and Basic Approach 


Suppose we wanted to apply an IID procedure to a 
complex survey sample. Suppose, too, that we wanted to take 
a fresh look at “solving” the seam problem that occurs 
because the survey design is not IID. How might one 
proceed? Well, there is a familiar expression that may fit our 
approach 


If you only have a hammer, every 
problem turns into a nail. 


Now, as samplers, we have a hammer and it is sampling 
itself. Can we turn the seam problem in surveys into a nail 
that can be dealt with by using another sampling design? 

It is our contention that some of the time the answer to this 
question is “Yes.” We call this second sample design an 
“Inverse Sampling Design Algorithm” — hence, the name of 
this paper. 

Aschematic mighthelp visualize the algorithm (see figure 1). 
In the diagram two sampling approaches are compared — both 
yielding simple random samples from a population: 

(1) The first design (top row) does this by employing a 
conventional direct simple random (SRS) selection 
process (e.g., Cochrane 1977), such that all possible 
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samples of a given size have the same probability of 
selection. (Such designs are often impracticable or 
inefficient or both; hence, they are almost never used by 
survey samplers, despite their ubiquity in textbooks.) 

(2) The second design envisions a two-step process. The 
first step is to sample the population in a complex way 
that focuses carefully on the nature of the population 
and the client's needs — using the client’s resources 
frugally (this is the survey sampler’s province, par 
excellence). ‘ 

(3) What is new in our formulation is to draw a second 
(perhaps complex?) sample that inverts the first set-of 
selections, so as to yield at the end a simple random 
sample. Of course, to employ this two-step process to 
draw a single simple random sample from the usually 
much larger complex survey would be inefficient, so 
we propose to create multiple simple random samples 
and base our inferences on them. 


Popula- ae 
tion 7 
Selection 


Complex 
Sample 
Design 


Complex Inverse 
Survey Sample 
Sample Selection 


While elaborations are possible, the basic nature of the 
algorithms we are talking about should, by this point, be 
obvious. They can consist of just four basic steps: 

(1) Invert, if you can, the existing complex design, so that 
simple random subsamples can be generated (to some 
useful degree of approximation). 

(2) Potentially, apply your conventional statistical package 
directly to the subsample, since that is now appropriate. 

(3) Repeat the subsampling and conventional analysis, in 
steps (1) and (2), over and over again. 

(4) Retain, if you can, the flavour of the original 
randomization paradigm by using the distribution of 
subsample results as a basis of inference (rather than 
the original complex sample). 


Notice some things that this approach is — and is not: First, it 
is extremely computer intensive — presupposing cheap, even 
very cheap computing. Second, it presupposes that practical 
inverse algorithms exist (which may not always be the case). 
Third, it also assumes that the original power of the full 
sample can be captured if enough subsamples are taken, so 
that no appreciable efficiency is lost. Fourth, as much as it 


may resemble the bootstrap (Efron 1979), we are not doing 
bootstrapping. There is no intent to mimic the original 
selections, as would be required to use the bootstrap properly 
(e.g., McCarthy and Snowmen 1985; Rao and Wu 1988) — 
just the opposite; our goal here is to create a totally different 
and more analytically tractable set of subsamples from the 
original design. 


2.2 Defining An Inverse Sampling Algorithm 


Suppose that we wish to draw a simple random sample, 
without replacement, from a finite population of size N. 
Suppose further that the population is no longer available for 
sampling, but we have a sample selected from this population 
using a sample design D; let S,, denote this sample. Let S_, 
denote a second sample of size m that could be drawn from 
the population. An inverse sampling algorithm must describe 
how to select a sample from S,, so that for any given sample S_, 


Pr(select S_|S,) * Pr(S,,¢ Sp) = pit Leas (1) 


(] 

m 

The first step is to calculate the probability that an arbitrary 
but fixed sample S,, is contained in the sample S). 
Obviously, there are constraints on the size of the simple 
random sample (SRS) that can be drawn in this manner; the 
probability that S, contains S cannot be zero. Certainly, 
therefore, the SRS cannot be larger than the size of the 
original sample S,, and in fact the size of the SRS is 
generally required to be much smaller than the original 
complex sample. 

The problem, then, is to find a general algorithm to select 
an SRS froma given sample S,, with the correct conditional 
probability. It is also necessary to check that valid probability 
functions are used. The following subsections show the 
inverse sampling algorithms for a few of the more common 
sample designs: stratified, cluster, multistage, and stratified 
multistage designs. We also give an example where an inverse 
algorithm at first does not appear feasible. 


2.3 Inverting A Stratified Sample 


In this subsection the inverse algorithm is given for a 
stratified sample with four strata. The algorithm generalizes 
for any number of strata. We have a stratified sample with 
fixed sample sizes n, in each stratum /, and known stratum 
population sizes, VN, + N,+N,+N,=N. Because a given 
sample of arbitrary size m from the population might be 
contained entirely within one stratum, the largest simple 
random sample that can be selected from a stratified sample 
is of size m = min{n,}. 

For a given sample S,,, let (x,,x,,x;,x,) denote the 
number of units in each stratum. Each x, will be between 0 
and m, and x, +x, +x, +x, =m. The probability that S_, is 
contained in the stratified sample is equal to the number of 
stratified samples containing these m units divided by the total 
number of possible stratified samples, i.e. 
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Pr(S_<S,,) = aD) 


The algorithm for selecting a SRS from the stratified 
sample consists of the following three steps: 
(1) Determine the size of the SRS to be selected: 
m < min {n,}. 
(2) Generate a realization {m,,...,m,} from a hyper- 
geometric distribution, with probabilities 


4 iH N, ts 

i i i i 

BN 21h =3 52.07, =1,) = 7 (3) 
4 
m 


where i, gee Pitt Pe 7) and O<i,<sm,0<i,<m, 
O<i,<m, O<i,< m. 

(3) In each stratum h, select a simple random sample of 
size m,, without replacement, from the n, sample 
units. 

The conditional probability of selecting the sample S_, 
given that it is contained in the stratified sample, is then 


| i | uy 
al X4 l 
a 4 
(*) 7 i (4) 
i : : - 


The probability of selecting any given sample S_, using the 
inverse algorithm is the product of the two probabilities given 
in equations (2) and (4). It is straightforward to show that this 
product is equal to 


Therefore this procedure reproduces a simple random 
sampling mechanism unconditionally, i.e., when taken over 
all possible stratified samples. Note that in order to generate 
all possible SRS’s from this population, the entire sequence 
must be repeated, starting with selecting a stratified sample 
and proceeding through steps 1 - 3. 


2.4 Inverting a One Stage Cluster Sample 


In this subsection, we consider three special cases. To 
begin with, we examine cluster samples where the clusters are 
of equal size. This is followed by the more usual case where 
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the clusters are of unequal size. In both of these settings we 
assume the clusters are sampled by a simple random sampling 
mechanism and without replacement. The third case studied 
is that of sampling unequal clusters by a probability 
proportional to size (PPS) mechanism. In this last instance we 
assume that the sampling is with replacement. 


2.4.1 One Stage Cluster Sampling With Equal Cluster 
Sizes, Sampled With Equal Probability 


Assume we have a population of N clusters where all 
clusters are of size M and k of them are selected by a simple 
random sampling mechanism without replacement. 

To construct an inverse algorithm, we need to decide what 
the largest element subsample might be. It is immediate that 
the largest SRS of elements that can be selected is k. 
Incidentally, the cluster size is not a constraint on the size of 
the subsample. 

For a given sample S,, let g denote the number of clusters 
represented in S,,0<q<k. Then the probability that S, is 
contained in the cluster sample is equal to the number of 
cluster samples containing these q clusters divided by the total 
number of possible cluster samples, i.e. 


Nog 
k-q 


( : (5) 


As for the stratified sample, the algorithm first determines the 
number of units to be chosen from each cluster, 
(m,,m,, ...,m,). The probability distribution to be used to 
select the m,’s is 


Pr(S,¢'S;,) = 


M M 
iy i) N(N-1)...(N-q+1) 6 
id k(k-1)...(k-q+1) 

k 


where 0<i,< k,i, +i, +..+i,=k, and q is the number of 
nonzero i,’s. For example, with M= 100, N=6, and k=3 


Pr(m,=i,,....m,=1,)= 


100)/ 100}; 100 
1 2 6 *5 


Prim, = 1,30, =O, = 2) = 


Prom, = 3; 71, =O, =0) = 


600 
3 
Once the m,’s are determined, a simple random sample of 


size m, is selected from cluster i,i = 1, 2, ..., k. Therefore the 
conditional probability of selecting S, is 
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Pr(select S, | S,) = Ayal NN ss ee Chine G 47) 
jel k(k- 1)...(k-q +1) 


k 


The probability of selecting a particular sample S, is found 
by multiplying equation (5) times equation (7). It is routine to 
verify that this gives the correct probability of selecting an 
SRS. 

Unlike the stratified example, where the function for 
selecting the values of m, was a known probability function, 
it is not immediately obvious that equation (6) describes a 
probability distribution. Since the values generated by this 
function are all nonnegative, it need only be shown that they 
sum to one over the space of possible values. The first factor 
in the equation has the form of a hypergeometric distribution, 
except that the numerator is constrained to only & out of the V 
clusters, while the denominator still reflects the total NV 
clusters. It is useful to define a partition of k as a combination 
of positive integers that adds to k, without regard to order. 
For example, the partitions of k = 3 are {3}, {1,2}, and 
{1,1,1}. Because the clusters are all of the same size, M, all 
patterns of selection that correspond to the same partition 
have the same probability of occurring. Take, for example, 
N = 6, and k=3. In the full hypergeometric distribution, with 
equal cluster size, each of the following combinations has the 
same probability of occurring 


(0,0,0,0, 1,2), (0,0,0,0,2,1), (0,0,0,1,2,0), ..., (2,1,0,0,0,0). 


The total number of such combinations is 
NN - 1)...(N- q +1), where q is the size of the partition, 
that is the number of (nonzero) values in the partition. In the 
example above, gq = 2. For a given partition, if the nonzero 
counts can only be put into k specific cells, then there are 
k(k - 1)...(k- q+ 1) such orderings. Therefore, summing 
the distribution over all values of (i,, ..., i,) can be done by 
first summing over all partitions of & and then for each 
partition, summing over all possible orderings of that partition 
in k cells. Because all orderings associated with a particular 
partition share a common probability of occurrence, this 
results in a summation that is equivalent to summing the 
hypergeometric over the correct space, and therefore 
expression (6) sums to one. 

The probability distribution needed for this simple cluster 
design (equation 6) is noticeably more difficult to generate 
than the hypergeometric distribution in the case of the 
stratified sample. However, as the sampling fraction k/N 
decreases, the probability is often contained in only two of 
the partitions: gq =k and q =k - 1. (These probabilities are 
calculated in the Appendix). Indeed, the probability may be 
concentrated in just the pattern with q = k (A special case of 
this is also shown in the Appendix). 

Given the results in the Appendix, it may be possible to 
approximate the exact inverse by selecting one case from each 
cluster, using systematic sampling from the original cluster 
sample. This approach is of real value because the probability 


distribution calculations become unwieldy as the number of 
clusters in the sample grows large. For a systematic inverse to 
work, however, the “step” would naturally have to be at least 
as large as M or maybe even greater, depending on the 
number of clusters in the population. To carry out this 
subsampling repeatedly, for each systematic sample inverse, 
the units within each cluster would be reordered randomly 
before the next selection and the clusters resorted randomly 
as well - then another random start obtained before stepping 
again through the original sample. 


2.4.2 One Stage Cluster Sampling with Unequal 
Clusters, Sampled With Equal Probability 


The inverse sampling algorithm for a sample of clusters of 
equal size does not generalize readily when a sample of 
unequal sized clusters is drawn. This is so despite the fact that 
it would appear to be straightforward to generalize this 
approach in an obvious way. In particular, it does not seem 
difficult to generalize the previous method so that the 
“probabilities” would multiply out successfully to give the 
“correct” probability of selection, i.e. 


N 
where M, = > M. (8) 
1 


However, generalizing to unequal cluster sizes M, by 
selecting the m, as 


| q | e 
Pr =fyy.esty=Hy)= 9 Loa Eaah, 


>», (9) 


does not result in a valid probability distribution. We will 
again assume, by the way, that the original clusters are being 
sampled with equal probability and without replacement, as 
was the case in subsection 2.4.1. Later (Subsection 2.4.3), as 
already noted, we will look at original samples which employ 
some form of Probability Proportional to Size (PPS) 
selection. 

To see that it is not straightforward to simply generalize 
equation (6) into the form in equation (9), consider the 
following counter-example where the “probability” calculated 
using (9) is greater than one. Suppose N = 4 with cluster 
sizes; M, =4,M, = 6, M, = 8, and M, = 10. Suppose further 
that we draw a cluster sample with k = 2 and that just by 
chance the two clusters picked are the largest —i.e., M, = 8 
and M,=10. It is immediate that with these selections, 
equation (9) would generate a probability of selecting one unit 
from each cluster that was greater than one. 

Can this difficulty be fixed? Yes, although not perhaps in 
an entirely satisfactory way. One method is to employ a 
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hypergeometric that assumes all the clusters were as large as 
the largest cluster in the population. The price paid is that the 
inverse sample size achieved is no longer fixed, and the 
resulting subsample is only conditionally SRS given the 
achieved sample size, denoted, say, as k,. That is, for a given 
sample size k), ky < k, all samples of size k, have the same 
probability of being selected using the inverse algorithm. 
Let M, denote the maximum cluster size, M, = 
Max{M,, M,, ..., My}. Create a population by filling out each 
original cluster with “dummy” units or placeholders, 
J =M,+1,M,+2,...,.M,. Then using a method similar to 
Lahiri's (1951) for PPS sampling, the inverse algorithm 
selects units from the population consisting of NV clusters each 
of size M,, and then discards any element not in the 
“subpopulation” consisting of the original clusters of size M.. 
Specifically, given a cluster sample consisting of k 
clusters, select the vector m from the probability distribution 


Prog sd sein sie 


_ NW-1)...AN-q+1) (10) 
k(k-1)...(k-g+1) 


where the components of m sum to k, and q of the 
components m, are nonzero. This is now a proper probability 
distribution. Given the selected value of m,, select a random 
sample of m, units from cluster i, where the cluster contains M, 
units from the population and M,- M, “placeholders.” 
Discard any selected units that are placeholders, in the set of 
eM tol +2, 25M. Therefore the final sample size 
will not necessarily be equal to k, but may be smaller, say k,. 

The resulting sample is conditionally a SRS from the 
population, in the sense that for a given value of k,, all 
samples of size k, have the same probability of being selected 
using this inverse algorithm. To see this, continue to view the 
problem as a subpopulation, P, of N clusters of size 
M,,i =1,...,N, within a population P, of N clusters each of 
size M,. Note that for any sample, S,, of size k selected 
from the population P_,, the probability of selecting S, using 
the inverse algorithm is 


NM.) (11) 


If k, =4 then this is the probability of selecting this sample 
using the inverse algorithm. For a fixed k, <k, let S, denote 
any given sample of size k, contained in P. We can generate 
asample S, containing S, by starting with S, and adding to 
it k- k, elements from the N*M, - M, placeholders in P.,. 
The number of such samples S,, that result in selecting §,, is 
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N 
where M,=>, M,. (12) 


i=] 


NM, - M, 
k- k, 


Therefore, the probability of selecting S, using the inverse 
algorithm is equal to the probability of selecting S, using the 
inverse algorithm, given in (11), summed over all samples S, 
constructed as described above, where the number of such 
samples is given by (12). This probability equals 


NM.,-M, 
k- ke 
| 

k 


and all samples of size k, have the same probability of being 
selected using the inverse algorithm. 

There is a positive probability, unfortunately, that a sample 
might be selected with this approach that has no elements. 
This could occur if there were a large difference in the cluster 
sizes. However, if the number of clusters & in the original 
sample is large, this is unlikely to be a problem. 

Again, as in the case of equal cluster sizes, an approxi- 
mation is available using a systematic subsample as an 
inverse. This time we would want a step at least as large as the 
maximum cluster size. Using a systematic inverse, by the way, 
would have the advantage of controlling better the actual 
subsample size drawn. 


2.4.3 One Stage Cluster Sampling with Unequal 
Clusters, Sampled With Unequal Probability 


If a sample of & clusters is selected with PPS, an inverse 
algorithm may exist. Suppose the samples are selected with 
replacement from a population consisting of N clusters, with 
unequal cluster sizes, M,, M,, ..., M,,. Suppose, further, that 
the measure of size is either equal to M, or proportional to 
M,. Then at each draw, 


Pr(select cluster j) = 


ES 


nF (13) 
where M,->> M.. 
i=] 
Finally, since a one stage sample is being taken, once cluster 
J is selected, then all M, units from that cluster are included 
in the sample. 

An inverse algorithm in this case should result in a 
SRSWR. That is, for any vector § resulting from k 
independent selections from the population, the probability of 
selecting the ordered vector is 


Pa 


1\F 
Pr(select S$) =| —]| . 
r(select S$) | ry; | (14) 
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An inverse algorithm is to simply randomly select one unit 
from each cluster in the cluster sample. Because the clusters 
were chosen with replacement, one should think of the 
sampled clusters as being ordered, by the order in which they 
were selected, or in any fixed order. For example, if the 
population contained 20 clusters, a possible cluster sample of 
size k =5 is (7,5, 7, 18, 6), etc. 

The population consists of M, units, denoted as 
Uy, Uy, »--» Uyy,- Let S denote a given sample, with replace- 
ment, S = (S,, 55, -, S,), and let ¢ = (c,,C,, ...,c,) denote the 
associated cluster for each unit. For example, suppose the 
population is: 


Cluster Units 
] u, UU; Uy 
2 U, Us Uy Ug 
3 Uy Uy Uy 
4 Uy 413 414 
5 Bhs Uiits Cag) 
6 Ug Ug Ur 


and k = 3. Then the sample(s, = u,, 5, = U4, 5, = U,7) corres- 
ponds to c = (1,1,5). The sample (s, = u),, 5, = Uy, 55 = Uyg) 
corresponds to c = (6, 6, 6). Note that this second sample can 
only be selected if cluster 6 is the only cluster chosen in the 
cluster sample. 

For a given sample S of size k, and the corresponding 
vector c of cluster membership, the unconditional probability 
of selecting S using the inverse algorithm is 


Pr(select S | cluster sample c) * Pr(select c) = 


in) 


c(i) ) \ t=} M, 


which is equal to the desired probability, equation (14). 

Note that this same inverse algorithm works in the case 
where k clusters are selected with ppswr, but a sample of 
fixed size m is selected (srswor) from the chosen cluster, 
assuming that M,>m for all clusters 7. 


2.4.4 Some Comments On One Stage Designs. 


We have seen that, with care, inverse algorithms can be 
constructed for several special cases where the original 
sample has a one stage cluster design. Two of our results are 
for cluster samples drawn with equal probability without 
replacement. The third is a ppswr design. 

A convenient systematic inverse may even be workable as 
an approximation to the correct inverse algorithm when we 
have a cluster sample. The approximation works when using 
SRSWR is “close to” SRSWOR —e., in our notation when k/NM 
is very small so that 1/(NM - k + 1) is approximately equal to 
1/NM._ So everything seems intuitively to be consistent, 
across the cases studied. 


Many cluster designs do not fall into any of the special 
cases examined. For some of them we conjecture that exact 
inverse algorithms may not exist. In particular, the general 
case of PPSWOR sampling seems to be one of these, 
including the frequently used variant of systematic PPSWOR. 
This may, or may not be a problem for practitioners who often 
employ the (usually) conservative practice of assuming that 
the sampling was with replacement — in which case an inverse 
algorithm would exist to the same order of approximation as 
was being assumed to estimate variances. 


2.5 Multistage Cluster Designs 


What about multistage designs? Can they be inverted? In 
some cases, we believe the answer is “Yes.” Three designs 
will be looked at: (1) a two-stage design with simple random 
sampling at the first and second stages (Subsection 2.5.1); 
then, (2) a design which employed probability proportional to 
size (PPS) sampling at the first stage and simple random 
sampling at the second (Subsection 2.5.2). Finally, (3) the 
very important stratified multistage design with two PSUs per 
stratum deserves at least a brief comment. 

As will be seen, the stratified and one stage results extend 
fairly readily. To demonstrate this, our basic strategy is to 
repeatedly apply the approaches already discussed earlier. 


2.5.1 Multistage Designs With Simple Random 
Sampling at Both Stages 


Suppose, first, that originally a simple random sample of 
k clusters, all of size M, was drawn at the first stage and a 
simple random subsample of size “r’’ was drawn at the second 
stage, within each cluster selected at the first stage. 

As earlier, our inverse sample can be no larger than k. 
Suppose first that 1/(NM - k + 1) is approximately equal to 
1/NM, then we can employ an srswr inverse algorithm, since 
SRSWR and SRSWOR are very close. Using the results in 
Subsection 2.4.3, we would take a SRSWR sample of k 
clusters and then within each selected cluster take one 
observation at random. Alternatively, we could as in 
Subsection 2.4.1, first determine the number of units to be 
chosen from each cluster, (m,, My, «+, M,). Once the m,’s are 
determined, a simple random sample without replacement of 
size m, is selected from cluster i, i = 1, 2, ..., k. This may be 
a nearly exact result, except for the possibility that the inverse 
second stage sample size m, may be larger than the original 
second stage sample size “r.”’ When this occurs, we still can 
appeal to the results in Subsection 2.4.2 and draw our second 
stage sample with “placeholders.” In this second instance, the 
resulting actual sample would no longer be fixed; but still 
would be conditionally SRS. If the first stage clusters are 
unequal in size but sampled with replacement, then we can 
again employ the trick used in Subsection 2.4.2 of creating 
“placeholders.” The sample sizes are random and only 
conditionally do we achieve an SRS inverse. 

Another way to approach this problem is to note that the 
largest SRS that can be selected using an inverse algorithm is 
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of size k, = min{k,r}. This is done by first determining the 
number of units to select from each cluster, (7m,, m, ..., m,), 
where now the m,’s must sum to k, rather than k. Once the 
m,’s are determined, a simple random sample of size m, is 
selected from cluster i, i = 1, 2,...,k. The probability distri- 
bution to be used to select the m,’s is 


M M 
4 Kk), N(N-1)...(N-9+1) 


P =1,, ..-)M,=1,) = 
T(m, L m, i,) * k(k-1)...(k-q+1) 


k, 


where O<i,<kyi, +i, +..+ i= k,, and q is the number of 
nonzero i,’S. 

One final comment, for both equal and unequal cluster 
sizes, the possibility of an approximate systematic inverse 
seems available — with essentially the same caveats, of course, 
as noted above. 


2.5.2 Multistage Designs With PPS Sampling at the 
First Stage and SRS Sampling at the Second 


Again, our inverse sample can be no larger than k. It is 
immediate that one way to construct an inverse would be to 
use the results in Subsection 2.4.3. Specifically, we would 
take a srswr sample of & clusters and then within each selected 
cluster take one observation at random. Other inverse 
algorithms may exist too. A systematic inverse seems 
reasonable, provided the probability of selecting the same 
cluster more than once is small to vanishing. 


2.5.3 Stratified Multistage Designs With Two PSU’s 
Per Stratum 


Can two Primary Sampling Unit (PSU) designs be 
inverted? Our answer is “Yes,” if the within stratum 
selections are made in one of the ways we discussed in detail 
earlier. This is basically the only case we will cover. 

From our results in Subsections 2.3 and 2.4, it is 
immediate that if an inverse is to exist, then the sample size m 
cannot be any larger than m = 2. Depending on the sampling 
within each strata, we could employ one or more of the exact 
or approximate inverses to obtain two SRS selections within 
each stratum. To obtain an overall SRS sample, we would 
employ the inverse algorithm of Subsection 2.3 on these two 
selections and end up, finally, with just two selections overall. 


2.5.4 Some Comments On Multistage Designs 


In this Subsection, we have quickly covered a few 
multistage designs and provided exact or approximate 
inverses. The results were derived by appealing to earlier 
results in Subsections 2.3 and mainly 2.4. Of course, many 
multistage designs do not fall into any of the special cases 
examined - notably those with systematic selections at the last 
stage. 
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One last observation, many readers may wonder, at this 
point, how a method that selects only a sample of size two (as 
we did in Subsection 2.5.3) can be of any practical value. 
Perhaps the next section will help. 


3. RESAMPLING TO INCREASE POWER 


3.1 General Setting 


Drawing a single, smaller simple random sample from a 
larger, more complex sample might be adequate for some 
users in some settings. However, for most users, the loss in 
power between the estimate based on the complex sample and 
the estimate based on a simple random sample would not be 
acceptable. 

In order to increase the power of our approach, it was 
natural to consider resampling techniques. We are limited in 
the size of the SRS that can be drawn, but we can repeat the 
process. By repeating the entire subsampling procedure, we 
can generate g simple random samples each of size m, where 
each SRS is selected independently from the overall original 
sample. Each repetition must include all steps of the 
subsampling procedure. For example, in the stratified case, 
the stratum subsample sizes must be redrawn using the 
hypergeometric distribution. 

In this section, conditions are given under which the 
precision of the estimates using multiple SRSs can be made 
arbitrarily close to the precision of the original estimates. We 
will begin our discussion by first defining some notation. 

Let D denote any invertible design (such as a design of the 
type covered in Section 2). Let T be the population quantity 
of interest (say, a population total); and let 7,, be an unbiased 
estimator of T calculated from the sample §,,. Suppose g 
simple random samples are independently drawn from the 
given sample S',, and let ¢, denote the estimator from the i-th 
simple random sample. Then it can be shown that 


if E(t, | Sp) =Tp 
1X 1 
then Var] — }>t,]| = Var(T,) + — (Var(t,) - Var(T,,)). 
& i=l &§ 
Proof: Because the g replications of the simple random 


sampling process are conditionally independent, then 


5 : Ys, 22 
for i+j, Et,t,|Sp)=Tp. 


Therefore, unconditionally, for 7 not equal to /, 


Cov(t,, t,) = E(t,t,) - 10% 
= Var(T,). 


And the result follows directly. 

Some of the conditions in this proof can be relaxed; if 7, 
is biased, then similar results can be obtained for MSE instead 
of variance. However, the condition that 
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is necessary. And this condition is not met for ratio 
estimators. But, if the condition is met separately for the 
numerator and for the denominator of the ratio estimate and 
if the final size of the combined sample is sufficiently large so 
that a Taylor Series approximation is acceptable, then similar 
results can be found for approximations to the variance for 
ratios in the usual manner. Incidentally, even in the two PSU 
per stratum design, this approach works — provided we can 
obtain an unbiased estimate from each individual sample of 
size 2. And for estimates of totals, this can be the case — 
assuming at each stage of sampling that an inverse can be 
constructed. 


3.2 Estimating The Sampling Error for Means 
or Totals 


By resampling, one can achieve almost the same precision 
as the original design estimator. But because the resampled 
srs's are only conditionally independent, the estimation of the 
standard error is not as simple as if only one srs had been drawn. 
However the estimation remains relatively straightforward. 

Let S? denote the population variance for the variable X 
and let T be its population total. For the sample means, totals 
and variances calculated from the generated simple random 
samples, let 


where X¥,, = — ae MEDS: 
‘ed 


Note that the sample variance using all gm units can be 
expressed as 


2 1 Sa me 2_ mg 2 
S = (m-1))_ s, +—)_(t,- T° - —=(,, - 1)|- 
mg - 1 2 1 ade I N2 
Hence 
g 
pile > mM mg 
E(s,) = (m - 1)S“+—)_ Var(t,.) - —2Var(t,,)}. 
mg-1 A aes Je soit 


Rewriting this gives 
Var(t,.) = m mls: . ( 1 
m & 


‘i v{ ve E(s°). 
mg 


& 
Var(¢) 
J 


=] 


Therefore, by replacing S? and Var(t,) with unbiased 
estimates and replacing E(s°) with 5), we can generate 
approximately unbiased estimates of Var(t, ,). 

It may be worth emphasizing that this result does not 
require the user to know anything about the original sample 
design. If users are given a way to invert the original design, 
then they can, by repeated subsampling, achieve nearly the 
efficiency of the original design and readily estimate the 
appropriate sampling errors. There is one condition on this 
result, namely that the subsample size be such that m > 2. 
Incidentally, for m = 2, the variance expression becomes 


Z 
Vat.) =—S? + (+ > Var(t.) - | ot E(s2). 
2 z 2g 


jet 


Based on this, as above, a variance estimator could be built 
for two PSU per stratum designs. 


3.3 An SOI Illustration 


In this subsection we consider an example of an inverse 
algorithm and how well it works. The Statistics of Income 
(SOD corporate sample will be our starting point. Now, as 
noted earlier, the SOI sample has essentially a stratified SRS 
design and so can be inverted (subsection 2.2). 

It is our belief that many SOI users might find a full SRS 


‘inverse sample more valuable and easier to employ than the 


complete, stratified sample data base. An interim goal could 
be to provide them with a set of simple random samples. A 
more flexible system would be to provide the interactive 
software to allow the user to designate the simple random 
samples of interest, to be selected from the complete data 
base. 

In our simulations we used four of the strata in the SOI 
sample of corporate returns, namely the strata representing the 
smallest regular corporations (Hughes et al. 1994). As can be 
seen from table 1, the stratified sample (of four strata) 
consisted of 15,618 units, and the largest SRS that can be 
selected is m = 2,224. The table also shows the population 
sizes and the estimated variance of the variable Total Assets, 
within each stratum. 


Table 1 
Corporate Population and Sample Size, plus Estimated 
Stratum Variances, For Four SOI Stratum 


Strata : 
(h) , My (in i 00's) 
i 1,376,801 3,889 222,808 
2 552,909 —-2,224 670,162 
3 678,371 4,005 12,796,578 
4 436,023 5,500 14,984,753 


The variable total assets was used because it is the primary 
stratifying variable; and, therefore, the loss in precision due 
to removing the stratification should be relatively large. 
Indeed, this proved to be the case. 
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Shown below is the ratio of the variance of the estimated 
total using g simple random samples, of 2,224 each, divided 
by the variance of the total based on the stratified sample. 
The table displays values of g from 1 to 1,000. For example, 
if only one SRS is selected the variance of the estimated total 
is 29 times larger than the variance of the stratified total. 


g Relative Variance Increase 
1 293) 
2 15.16 
10 3.83 
100 1.28 
500 1.06 
1000 1.03 


By resampling 500 to 1,000 times, the variance has been 
reduced to the same order of magnitude as the stratified 
sample. Even at 100 subsamples good results exist here, 
suggesting that the use of an inverse algorithm could work 
well for strata such as these. This is not to recommend that an 
inverse algorithm be employed in general with so few 
resamples. Doubtless, in highly skewed populations a much 
larger number would be required. 


4. POTENTIAL APPLICATIONS 
AND NEXT STEPS 


In this paper we have shown that inverse sample design 
algorithms exist in a few special cases. We do not, as yet, 
have a general result — if, indeed, there is one. This is clearly 
a part of the problem that needs more work. Like most tools, 
an inverse sampling algorithm may not be the best choice in 
certain cases; it may not be even a reasonable alternative in 
some circumstances. But there are applications where it 
appears to have advantages and so should be considered. In 
this section we both briefly suggest areas where this 
methodology may be useful and also mention some of the 
limitations and problems that remain. 


Customer-Driven Perspective — It is worth emphasizing the 
customer-driven nature of our approach. Even if it could not 
be justified on other grounds, inverse algorithms might be 
advocated as a part of “reinvention” (e.g., Osborne and 
Gaebler 1992). Right now many large complex surveys may 
not be sufficiently benefiting society, because they are so 
badly under-analyzed or even misanalyzed: 

— For the long run, we must work towards increasing the 
survey and general quantitative literacy of existing and 
potential customers — e.g., as with the new series What 
Ts a Survey? (Scheuren (ed.) 1995). 

— Inthe short run, we need to start where our customers 
are — giving due respect to the often small part that 
survey data may add to their decision making. Certainly 
it is worth thinking about ways to lower the cognitive 
costs customers bear when using our complex survey 
“products.” 
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A “Sample” of Possible Opportunities — There is an 
increasing awareness of the weaknesses within — the 
traditional randomization paradigm (e.g., Sarndal and 
Swensson 1993). Of particular concern here is all the fiddling 
we have to do when trying to.correct for nonsampling errors. 
Some of this flavour is evident in Rao and Shao (1993). By 
putting the possible adjustments for these nonsampling errors 
back into a simple random sampling framework, we may, 
indeed, be able to make more progress. 

For decades, survey practitioners have elaborated 
exceedingly complex sample designs; and, then, made 
efficient point and confidence interval estimates from them. 
On the other hand, how much do we really understand about 
the distributions that our sample estimators generate when 
effective sample sizes are small to moderate? Will we be able 
to fully capitalize on the “visualization revolution” now 
occurring (e.g., Cleveland 1993)? Particularly in the presence 
of nonsampling error? Maybe we should be building in a way 
to always look at distributions. The use of an inverse 
sampling algorithm might be one possibility (See also 
Pfeffermann and Nathan 1985). In any case, stronger 
visualization tools for complex surveys could help, even the 
very experienced among us, deepen our intuitions and connect 
them better to the particular population under study. 
Obviously, visualization efforts also pay off by lowering the 
price customers pay to use survey data. 

An intriguing problem where the inverse sampling 
algorithm may have an application is the case where we have 
a two PSU per stratum design with L strata where L is small, 
say less than 30. Suppose further that for some of the 
variables in the survey the stratification and clustering are 
unimportant — i.e., the design effect is 6 = 1, approximately. 
For these variables, would it not be possible for the stability 
of the variance estimate to be greater with the resampled 
method than with the Balanced Repeated Replication (BRR) 
approach to variance estimation that is usually employed? 

Another example that we are considering is the case where 
the user is interested in tests of independence in 2 x 2 tables, 
based on stratified sample data (Hinkins, Oh and Scheuren 
1995). For the chi-square test statistic we are now in the 
midst of comparing our results with the approach suggested 
by Scheuren (1972) and Fellegi (1980). So far it appears that 
the power of our method is comparable to these more familiar 
approaches (as might be expected from, say, Westfall and 
Young (1993)). This may be an instance where the extra 
work involved in the inverse sampling algorithm may have 
real benefits — beyond just making it easier for users to 
employ familiar tools — by allowing the user to look at the 
distribution rather than just one p-value. 


A “Sample” of Problems Remaining — A “sample” of the 
problems that remain with our inverse algorithm might be 
given here. For example, what happens when we do not know 
what the population size is? What happens when the 
population has more than one elementary unit — persons, say, 
for one analysis; households for another; neighbourhoods for 
still a third? Answers exist for these difficulties but they have 
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an ad hoc flavour to us. In many surveys, for instance, we 
guess about N and use that guess in poststratification. That 
degree of approximation for an inverse might be acceptable. 
For the problem of multiple analysis units, we could do 
several inverses. While potentially workable, this seems 
exceedingly awkward. 

We have indicated that in some cases it may not be too 
difficult to resample multiple times using the inverse 
algorithm in order to reproduce reasonable efficiency. But 
what about the case where the user of a stratified sample is 
interested in subpopuljations. If the domains of interest are in 
fact the strata, then the user does not gain any benefits by 
using the SRS’s produced using the inverse algorithm. If the 
domains of interest cut across the strata and they are small, 
then the number of samples required using the inverse 
algorithm may be very large in order to maintain reasonable 
estimation for the domains. 

Finally, we briefly mention one more problem that we have 
thought about. Many multistage designs actually select only 
one PSU per stratum. The strata are then paired for variance 
estimation purposes. We have already noted that an inverse to 
this approximation is available which can be made about as 
good as that approximation is to begin with. Is there a way to 
get a better approximation using the inverse approach 
directly? 


Last Words — Many things are changing in our profession. 
The worldwide quality revolution certainly has had an impact 
(Mulrow and Scheuren 1996). We are remaking the way 
surveys are done — from design, to data capture, to the way 
customers use them. This paper may be a small contribution 
to that process. 
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APPENDIX 


Suppose one has a cluster sample of & clusters from a 
population of N clusters, where each cluster has the same 
number of units, M. In the inverse sampling algorithm, the 
first step is to choose the vector (m,,m,, ..., m,) containing 
the number of units to be chosen from each cluster. Let q 
indicate the number of nonzero values of m,. The probability 
of selecting the one pattern with g = k, that is the pattern with 
im, = Typ tor all b= 1, 2, sayhs AS 


(Naot) WV =n2 (N= Kit by) 


Prig=)=M"" 
(NM - 1)(NM - 2)...(NM - k + 1) 


Call this probability P,. If NM>>>k then P, can be 
approximated by 


I (N- i) _ (N-1)(V-2)..(N-k+1) 
ict N NY} i 


Consider next the partition of k corresponding to 
q=k- 1; this corresponds to exactly one partition of k, 
namely {1,1,...,1,2}. There are k(k- 1) equally likely 
possible patterns of (m,,...,m,) with q=k-1. The 
probability of selecting a vector m with gq =k - 1, is 


wk = Wii 1) p 
2M(N-k+1) " 
Therefore it is not difficult to calculate the probability that the 


selected m has either g = k or q=k- 1. The following table 
shows some examples for two values of M. 


Pr(q =k- 1) 


Table A 
Pr(g=k-1 or g=hk) 

k N M=10 M= 100 
4 8 92 .90 
4 20 99 .98 

10 20 38 34 

10 30 .63 59 

10 50 .83 .80 

10 200 99 .98 

50 500 BS) 30 

50 1000 10 .66 

50 5000 98 .98 


For small k, it is not difficult to calculate the entire 
probability distribution needed to generate m. But as k 
increases, the number of partitions increases, and this 
calculation becomes difficult or at least tedious. For k = 4, 
there are only 4 partitions; for k = 10 there are 39 possible 
partitions. One can see from Table A, that as the cluster 
sample becomes “larger,” if the sampling rate is small 
enough, i.e., if k<<WN, then one might only need to calculate 
the probabilities for these two partitions in order to 
approximately invert the cluster sample. For k= 10 and 
N = 200, these two partitions essentially account for all of the 
probability distribution. 

The probability of selecting just one unit per cluster 
(q =k) is smaller than the values in Table A; so, in order to 
use a systematic inverse, we would want k<<<WN. This can 
be obtained in some settings when the number of clusters is 
large and we are willing to take k very small, relying on 
repeatedly resampling the original survey, as described in 
Section 3. 

To illustrate, assume a sample of size k, where, of course, 
k,<k, so that an inverse is possible; Further, to see if a 
systematic inverse would work, let k}<<<N. This is the 
case we illustrate in table B. In table B, we have confined 
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attention to just one value of N, NV = 5000 clusters, although 
the results could be extended readily. 


Table B 
Pr{inverse sample picks the pattern (1,1, ..., 1)} 

ky k JN M=10 M=100 

.0004 .9998 .9998 

5 .001 9982 .9980 

10 .002 .9919 9911 

20 .004 .9663 .9627 

30 .006 9245 .9166 

40 .008 .8687 8553 

50 01 8015 .7821 


Clearly, as k/N gets small, a systematic sample becomes a 
better and better approximate inverse. Only experience would 
confirm if the approximation at ky = 20 and k,/N = .004, say, 
is adequate. We think it might be, especially since the effect 
of using a systematic inverse usually is to make the variance 
calculations more conservative (since typically the intracluster 
correlation p > 0). 
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Variable Selection for Regression Estimation 
in Finite Populations 


PEDRO L.D. NASCIMENTO SILVA and CHRIS J. SKINNER’ 


ABSTRACT 


The selection of auxiliary variables is considered for regression estimation in finite populations under a simple random 
sampling design. This problem is a basic one for model-based and model-assisted survey sampling approaches and is of 
practical importance when the number of variables available is large. An approach is developed in which a mean squared’ 
error estimator is minimised. This approach is compared to alternative approaches using a fixed set of auxiliary variables, 
a conventional significance test criterion, a condition number reduction approach and a ridge regression approach. The 
proposed approach is found to perform well in terms of efficiency. It is noted that the variable selection approach affects 
the properties of standard variance estimators and thus leads to a problem of variance estimation. 


KEY WORDS: Auxiliary information; Calibration; Sample surveys; Subset selection; Ridge regression. 


1. INTRODUCTION 


Regression estimation is widely used in sample surveys for 
incorporating auxiliary population information (Cochran 
1977, chap. 7). For the basic case when the population mean _Y 
of a vector of variables x, is known and simple random 
sampling is used, the regression estimator of the population 
mean Y of a survey variable y, takes the form 


y,=y + (X-x)'b (1) 


where y and x are the sample means of y, and x, 
respectively, and 6 is the sample vector of linear regression 
coefficients of y, on x,. 

Regression estimation is useful for at least three reasons. 
First, it is flexible. Any number of population means of 
continuous or binary variables can, in principle, be 
incorporated into X. In particular, poststratification arises as 
a special case (Sarndal, Swensson and Wretman 1992, sec. 
7.6). The procedure also extends to handle complex sampling 
designs. Second, regression estimation has certain optimal 
efficiency properties. See, for example, Isaki and Fuller 
(1982, Theorem 3). Third, y, has the “calibration” property 
that if y, is one of the variables of x, so that Y is known then 
y, = Y (Deville and Sarndal 1992). 

In this paper we consider the question of how to select the 
x variables for use in the regression estimator. This question 
is of interest for at least two reasons. First, there is simply the 
practical reason that in some circumstances the number of 
potential variables in x, may be very large. For example, in 
population censuses in a number of countries values of some 
variables are recorded on a “short form” for all individuals 
and values of other variables are collected on a “long form” 
for a sample. The population means of the short form 
variables together with their squares, cubes, products and so 


forth will thus be known. Small area identification will also 
typically be available. Thus the dimension of x, as a vector 
containing functions of the short form variables together with 
dummy variables representing each small area could easily 
run into the thousands. In such cases, the selection of x 
variables becomes a practical necessity. 

A second reason is more fundamental for a model-assisted 
or model-based approach to survey sampling. These 
approaches may be characterised as follows in the context of 
regression estimation. First a regression model is selected 
which has “good predictive power’, so that the regression 
estimator will have “good efficiency”. Then, either a design- 
based approach to inference is adopted in the model-assisted 
approach (Sarndal et al. 1992) or model-based prediction is 
employed in the model-based approach. Although the 
literature on the latter problem of inference is vast, there 
seems remarkably little formal attention devoted to the former 
model selection problem. In practice, the most that seems to 
happen is that the “main” x variables which account for “most 
of” the sample R? are chosen (cf. Sarndal et al. 1992, 
sec. 7.9.1). However, more theoretical guidance seems 
needed, especially when a large number of x variables is 
available. 

A further reason for considering the variable selection 
problem more formally is that it may help clarify the issue of 
the impact of variable selection on inference. The problem 
that sample-based selection of estimators may affect the 
properties of the selected estimator has long been recognized 
(Hansen and Tepping 1969, App.) but little study seems to 
have been made of what the effects may be. 

In this paper we consider a variable selection approach 
aimed at minimising the mean squared error of y,. First, 
however, we study the dependence of the mean squared error 
of y, on the number of x variables in section 2 and then 
consider alternative estimators of the mean squared error of y, 
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in section 3. Variable selection procedures based on these 
estimators are then proposed in section 4. 

We contrast our variable selection approach with four 
existing approaches. First, we consider the traditional 
approach of using a fixed subset of auxiliary variables 
regardless of the observed sample. Next, we consider a 
“condition number reduction procedure” inspired by work of 
Bankier (1990), in which auxiliary variables are discarded in 
order to reduce the condition number of a certain cross- 
products matrix of the x variables. 

Third, we follow Bardsley and Chambers (1984) and 
consider a ridge regression approach. This does not involve 
variable selection but instead addresses the possible problem 
of multicollinearity in the regression estimator by modifying 
the estimator, allowing for some calibration error. Both the 
ridge regression and condition number reduction procedures 
have the advantage that they do not require specification of a 
response variable y, because they aim to provide a single set 
of “calibration” weights to be used for all survey variables. 
However, they do not guarantee gains in efficiency. Their 
results are separated by a line from the results for the other 
procedures in the tables presented in section 6 to indicate that 
they differ. 

Fourth, we consider variable selection following 
conventional significance test criteria. Our general view is 
that the objective of variable selection in regression 
estimation for finite populations is quite different from the 
objective of parameter estimation or prediction of y values for 
single observations in classical regression (Miller 1990). 
However, it seems desirable to treat such an approach as one 
benchmark for comparison. 

In section 5 we consider properties of the regression 
estimator following variable selection on the basis of 
estimated variances. Section 6 describes an empirical study 
carried out to compare our proposed variable selection 
procedures with the competing procedures described above. 
This study used data from a test census carried out in the 
municipality of Limeira, Brasil, as part of the preparation for 
the 1991 Brazilian Population Census. Section 7 presents our 
conclusions and some directions for further research. 


2. THE DEPENDENCE OF THE VARIANCE OF 
THE REGRESSION ESTIMATOR ON THE 
NUMBER OF x VARIABLES 


We begin by defining some notation. Let U = {1,...,N} 
denote a finite population of N distinguishable elements and 
let s c U denote a sample of n distinct elements drawn from 
U according to a simple random sampling without 
replacement design. Let x, = (XipreXiq) be the g x 1 vector 
of auxiliary variables associated with the i-th population 
element. It is assumed that the sample values of x,(/€ 5), 
together with the population mean vector XY = N OM yx; are 


known. The vector of sample means is denoted x = n~!)’ _.x,. 


Let y, denote the value of a survey variable y for the i-th 
population element and suppose the values of y, are only 
observed for i € s. The aim is to estimate the population mean 
ERME Dao a 

The regression estimator of Y is given by y, in equation (1), 
where y =n ey ay b= SiSes s, yee bk (x, 7 X)(X, _ x)’, 
and S.=n Ves &) - 204 - ). 

This estimator may be motivated by the underlying linear 
model 


y, = By +x; B+ E, (2) 


where the €, are independent disturbances with zero means 
and common variance 6”, since we may write y, = B, vex B, 
where B, = y - ¥’b and f = b are the least squares estimators 
of B, and B, respectively. Under this model the variance of 
y,- Y conditional on the x, may be written 


Var (7, - ¥ |x, =o2n [1 - n/N + (X - x)'S'(¥ - x)).Q) 


The final term may be interpreted as the effect of 
estimating B by 5. As the number q of x variables increases 
the residual variance o* may be expected to decrease, but the 
term (X - Dye - ¥) may increase as Si becomes more 
unstable. An alternative way to interpret this term is to write y , 
as a weighted estimator y, = TS eng where g,=1 + 
(X - x)’S_ (x, - x). Then we may write (3) alternatively as 


Var (7, - ¥ | x,) =0n “(1 - nN + ¢;) (4) 


where c, is the sample coefficient of variation of the g,. 

To study the expected dependence of a on q we now 
extend the model by supposing that the x, are independently 
and identically normally distributed. Noting the independence 
of (¥ - X) and S, and also that E,,(¥, - Y | x,) = 0, we obtain 
the unconditional variance 


Vary(¥,- Y) 
on {1 - nIN + trlE, A(X - ®)(X - ¥)'JE, (S,')]} (5) 
= on !(1-n/N)[1 + qin - q - 2)) 


using the fact that n~! i has an inverse Wishart distribution 
(Mardia, Kent and Bibby 1979, p. 69 and 85). This result 
holds for large n even without normality, in the sense that 
{1 - n/N + ce, Vd - nIN)[1 + q/(n - q - 2)] still converges to 
1 as n increases for fixed g (under weak conditions). 
Expression (5) makes the dependence on g explicit. As g 
increases we may expect 6” todecrease but E PACE 2 to increase. 
The reduction of co? may be expected to be small after a few 
important x variables are included and thus the variance may be 
expected to start increasing at some point where the number of 
x variables is anonnegligible fraction of the sample size. 
Results (4) and (5) are based on strong modelling 
assumptions and hence provided us only with motivation. In 
the general case ¥ - X = O (n "”*) (under the randomization 
distribution with standard regularity conditions) so that the 
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last term of (3) is of O pin ~2). A more general second order 
asymptotic approximation for the design mean squared error 
of ¥, when model (2) need not hold may be obtained by 
generalising Theorem 4.1 of Deng and Wu (1987). Details are 
given in Silva (1996). 

Our aim is to develop a variable selection procedure that 
minimizes the estimated mean squared error of y,, and 
estimators of this mean squared error are considered next. 


3. ESTIMATION OF THE MEAN SQUARED 
ERROR OF THE MULTIPLE REGRESSION 
ESTIMATOR 


A simple estimator of the mean squared error of y, is 
obtained by generalizing expression (7.29) of Cochran (1977, 
p. 195) to the case of several auxiliary variables: 


l-fe 
v= Ss 
SP hogs (6) 
where S, =(n-q-1)'Y,,,€, and é, =(y,- 9) - (%,- X)'b. 


This estimator makes no allowance for the O(n?) 
component of the mean squared error, however. Thus, as a 
second mean squared error estimator, we generalize the 
estimator v,, studied in Deng and Wu (1987) to the case of 
general q. This is a special case of the model-based, bias- 
robust variance estimator G, originally proposed by Royall 
and Cumberland (1978), for the case where the residual 
variances in the model (2) are constant. This estimator is 
given by 

pee et ale yorere? (7) 


n(n fa 1) i€s 


where 
a, =(g; - 2g,f+f{(1 -f U1 - @,- x)'S,'@,- XM - 1))}. 


We originally conjectured that v, would be second order 
unbiased, as Deng and Wu (1987, eq. 4.4) show that it is for 
the case of g = 1. However this turns out not to be the case 
for general g > 1, although it may be expected that the bias of 
v, is smaller than that of v,, as indicated by the second order 
bias expressions for v, and v, obtained by Silva (1996). 

A difficulty with v, as a variance estimator is that it does 
not generalize easily to complex survey designs. Thus we 
consider as a third variance estimator a modified version of an 
estimator proposed by Sarndal, Swensson and Wretman 
(1989), defined as: 


eva Din? 
v, = —_——+—_ FOR 
i mn q= Doe , ®) 


This estimator may be expected to behave similarly to v, since 
a,=g, + O,(n ~"2) In the terminology of Sarndal et al. (1992, 
p. 232), the g, are the appropriate g-weights under simple 
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random sampling if (2) is adopted as the underlying model. 
Expression (8) differs from the corresponding estimator 
proposed by Sarndal et al. (1989, example 4.4) in that we use 
the denominator (” - q - 1) instead of the original (n - 1). 


4. VARIABLE SELECTION PROCEDURES 


We consider two basic variable selection procedures. First, 
an all subsets approach that involves computing one of the 
mean squared error estimators v,, v_, Or ib: of section 3 for all 27 
possible subsets of the g auxiliary variables (always including 


‘the intercept) and choosing that subset corresponding to the 


smallest mean squared error estimate. This procedure can 
clearly involve considerable computation if q is large. Thus as 
a second procedure, we consider a forward selection 
approach which starts with the sample mean as an estimator, 
then adds that variable which minimizes the mean squared 
error estimate. The procedure is repeated until the mean 
squared error estimate starts to increase, at which point the 
subset of variables which gave the minimum mean squared 
error estimate is selected. 

These procedures may be contrasted with an approach 
inspired by the work of Bankier and his associates — see 
Bankier (1990) and Bankier, Rathwell and Majkowski (1992). 
We call this a condition number reduction approach. To 
describe the approach, first note that the regression estimator 
in (1) can alternatively be expressed as 


ye = [ny + (NX* a Wey Ae x, ) OX IN (9) 


where X, is the nx(q+1) matrix with x, ‘= 
(1, x)1)-.%j)' = 'x,) as its i-th row, x" =(1:x’)’ and 
X* =(1:X’)’ are the sample and population mean vectors of 
x; respectively, and y, is the nx 1 vector with the sample 
observations of the response. 

The regression estimator thus depends on the inversion of 
the cross-products matrix X,’X,", a matrix which can 
sometimes become ill-conditioned and thereby inflate the 
variance of the regression estimator. 

Bankier (1990) proposed a two-step procedure for 
computing regression estimators of means (or totals) in which 
columns of the auxiliary data matrix X, were eliminated in 
order to reduce the condition number of the cross-products 
matrix MENS as well as to avoid undesirable situations 
(negative or outlying weights, rare characteristics, or exact 
linear dependence between columns). Bankier et al. (1992) 
describe in detail the procedure as applied to the 1991 
Canadian Population Census. It is worth noting that the 
approach developed by Bankier and associates, although 
incorporating variable selection, is not targeted at achieving 
efficiency for a particular survey variable. Its main focus is on 
calibration, while at the same time providing a single set of 
weights that are used for all survey variables. 
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The condition number reduction approach that we consider 
can be described by the algorithm below, which adopts a 
backward elimination procedure to discard auxiliary variables 
generating large condition numbers for the cross-products 
matrix CP = X,'X,,, instead of the forward inclusion of 
variables described by Bankier et al. (1992). 

1) Compute the cross-products matrix CP =X,’X, 
considering all the columns initially available 
(saturated subset). 


2) Compute the Hermite canonical form of CP, say H (see 
Rao 1973, p.18), and check for singularity by looking 
at the diagonal elements of H. Any zero diagonal 
elements in H indicate that the corresponding columns 
of X,’X, (and X,), are linearly dependent on other 
columns (see Rao 1973, p. 27). Each of these columns 
is eliminated by deleting the corresponding rows and 
columns from X,‘X,. 


3) After removing any linearly dependent columns, the 
condition number c =A, /A,;, of the reduced CP 
matrix is computed, where i,,, and A, are the largest 
and smallest of the eigenvalues of CP, respectively. If 
c <L, aspecified value, stop and use all the auxiliary 


variables remaining. 


4) Otherwise perform backward elimination as follows. 
For every k, drop the k-th row and column from CP, 
and recompute the eigenvalues and the condition 
number of the reduced matrix. Compute the condition 
number reductions r,=c-c,, where c, is the 
condition number after dropping the k-th row and 
column from CP. Determine r,_. = max, (r,)and 
Kenax = AK" max = 7,} and eliminate the column k,. by 
deleting the k. row and column from CP. Make 
c=c, and iterate while c > L and q2 2, starting each 
new iteration with the reduced CP matrix resulting 
from the previous one. 

One further approach that we consider is the ‘ridge 
regression estimator of Bardsley and Chambers (1984). It 
does not rely on selecting subsets from the auxiliary variables 
available, but rather on relaxing the calibration properties of 
the regression estimator in favour of more stable estimates. 
The ridge regression estimator is given by 


Vgc = Uny + (NX* - nx") (AC + XS XS) XS yN (10) 


where i is a scalar ridging parameter and C is a diagonal 
matrix of “cost” coefficients associated with the calibration 
errors tolerated when estimating totals of the auxiliary 
variables using yp. 

Bardsley and Chambers (1984) suggested that the 
specification of the matrix C could be used to control the 
influence of each auxiliary variable on the resulting estimator 
of the response mean, thus imitating the subset selection 
process. As for the ridging parameter i, they suggested 
taking the smallest value such that all the implicit case 
weights are not smaller than 1/N (or | for estimating totals). 


5. PROPERTIES OF REGRESSION ESTIMATORS 
AFTER VARIABLE SELECTION 


For our basic variable selection procedures, a set of 
estimation strategies S = { vv’); y€I} is considered, where 
y’ and v’ are the regression estimator and an estimator of its 
variance respectively for a subset y of the g auxiliary variables 
available, and I is the set of all subsets. The variable selection 
procedure selects a subset y* from I according to a rule 
which is determined by the data and by S, and the resulting 
point estimator is y” . 

For each fixed subset y, it follows under standard 
regularity conditions (Isaki and Fuller 1982) that y” is 
consistent for the population mean Y, that is y’ - Y = 0,(1). 
Now, for given 5>0, | y’ - Y | >6 implies |p’ - Y|>6 for 
some y, and so we have 


Pr(|y", - ¥|>8)s 5 Pr(|y- ¥|>9) (11) 


yer 


and because Tis finite, the right hand side of (11) con- 
verges to zero, and it follows that y’. is also consistent. 

The distribution of y’ will, however, depend on the 
selection rule in a complex way. See Grimes and Sukhatme 
(1980) for an investigation of the efficiency of y’ in the 
simplest case when there are just two possible estimators: a 
regression estimator with one x variable and a difference 
estimator (a special case of which is the mean) and the 
variables are jointly normally distributed. 

In contrast to the consistency of y” , there is no reason 
why v’ should be consistent for Var(p’), even if v7 is 
consistent for Var(y") for each fixed y. In particular we may 
expect v’ to underestimate Var(y" ) if the selection rule is 
such that v’ is the minimum of the v’. This effect is similar 
to the well known overestimation of R? after subset selection 
in standard multiple linear regression (Miller 1990, p. 7-10). 


6. ASIMULATION STUDY 


In this section we present a small simulation study carried 
out to evaluate the performance of the alternative variable 
selection procedures considered. We took as our simulation 
population a data set comprising 426 records for heads of 
household surveyed using the sample (long) questionnaire 
during the 1988 Test Population Census of Limeira, in Sao 
Paulo state, Brasil. 

This test was carried out as a pilot survey during the 
preparation for the 1991 Brazilian Population Census. The 
test consisted of two rounds of data collection. In the first 
round, each enumerator would visit all the occupied 
households in a given enumeration area (an area with between 
200 and 300 households on average) and would fill in a short 
questionnaire. This form contained a few questions about 
characteristics of the household and about each member of 
the household (sex, age, relationship to head of household 
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and literacy). For heads of household only, a question on 
education and another about monthly total income were also 
included. The reported monthly total income for heads of 
household provides only a proxy to the actual income, due to 
the limitations of the interviewing process in this first round 
of data collection. 

Then a second round of data collection was undertaken in 
each enumeration area. The same enumerators would visit a 
sample of 1 in 10 of the households (selected systematically 
from the list of occupied households compiled in the first 
round of data collection) to obtain information using a long 
(more detailed) questionnaire, which contained all the 
questions asked in the short form plus many other questions. 

The size of the surveyed population was approximately 
44,000 households with 188,000 individuals. The sample size 
was roughly 10% of the population size. For reasons of 
computational cost, we used in our simulation study a sub- 
population comprising all the sample records for 426 heads of 
household living in 20 of the 170 enumeration areas. We 
chose these records as our simulation population because they 
contain all the detailed information provided in the sample 
questionnaire, as well as the proxy information available from 
the first round interviews using the short form. 

We considered total monthly income, as obtained from the 
long form, as the main response variable (y) together with 
11 potential auxiliary variables, namely: 


EF indicator of sex of head of household equal male; 

x, = indicator of age of head of household less than or equal 
to 35; 

x, = indicator of age of head of household greater than 35 
and less than or equal to 55; 

x, = total number of rooms in household; 

x. = total number of bathrooms in household; 

X, = indicator of ownership of household; 

x, = indicator that household type is house; 

X, = indicator of ownership of at least one car in household; 

Xy = indicator of ownership of colour TV in household; 

X19 = years of study of head of household; 

= proxy of total monthly income of head of household. 


From these 11 variables, we constructed two alternative 
sets of auxiliary variables for our simulations. The first set 
was defined by taking five auxiliary variables, namely 
X>-+-,X, and x,,, that have reasonable explanatory power in 
predicting y, especially due to the presence of the proxy 
income x,,. The second set we considered contained ten 
auxiliary variables, namely x,,...,x;9, which due to the 
exclusion of x,,, has smaller predictive power than the 
previous one. For reference, the population correlation matrix 
for the survey variable y and the 11 auxiliary variables in the 
population is given in Table 3. 

We then selected 1,000 samples of size 100 from this 
simulation population by simple random sampling without 
replacement. 


ay: 


Before proceeding to examine the detailed simulation 
results, we first consider the potential for gains from variable 
selection following the motivating model-based discussion of 
section 2. Recall from equation (4) that under model (2) the 
conditional variance of y, is inflated by a term c ; because of 
estimation of B. We evaluated the distribution of c : over the 
1,000 samples for both the cases of five and ten auxiliary 
variables. For the case of five auxiliary variables, the median 
value of o was 0.036, with upper quartile of 0.056 and 
maximum 0.255. This accords roughly with equation (5) 
which implies that under the model the expected value of c, 
is (1 - n/N)q/(n- q - 2) = 0.041. Note that the wide varia- 
tion of c : across samples suggests that it may be sensible to 
adopt a procedure which selects a different set of variables for 
each sample. The variation of c? is even greater for the case 
of ten auxiliary variables, when the median was 0.078, the 
upper quartile was 0.107 and the maximum was 0.329, which 
also accords roughly with the expected value under the model 
of 0.087, according to equation (5). This interpretation 
clearly depends on the validity of the model (2), which is 
doubtful for these data, but it does suggest that there are 
potential efficiency gains to be made from variable selection. 

Another way to assess the potential for efficiency gains 
from variable selection is to compute approximations to the 
variance of the regression estimator considering various 
subsets of the auxiliary variables available, using all the 
population records. Figure 1 displays a plot of the 
approximation given by a finite population version of 
equation (5) computed for increasing subsets of the ten 
auxiliary variables, where the variable added at each step is 
the one yielding the biggest decrease in the approximation. 
The values of the standard first order design-based 
approximation (1 -/)S/n are also plotted for reference, 
although as has already been noted, this approximation is 
monotone non-increasing when new auxiliary variables are 
added. Simulation estimates of the mean squared error for the 
regression estimator corresponding to each subset are also 
plotted. The plot shows clearly that if a standard regression 
estimator with a fixed set of auxiliary variables is to be used, 
the subset with five predictors would be the best choice when 


2BeO—--+09-—-xOonvvVTO MYUE 


Number of auxiliary variables Included 


F=First order N=Normal model S=Simulation 


Figure 1. Finite population approximations and simulation estimations for 
the MSSE of the regression estimator computed for increasing 
subsets of the ten auxiliary variables. 
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the normal approximation for the variance based on 
expression (5) was considered, whereas the saturated subset 
would be chosen in case the standard design-based 
approximation for the variance was considered. The plot also 
reveals that the simulation estimates of the mean squared 
error agree more closely with the normal model approxima- 
tion than with the standard first order approximation, 
especially for larger subsets of auxiliary variables. Similar 
results are achieved when corresponding variance approxima- 
tions are computed given the set of five auxiliary variables. 

Hence both the simulation distributions of c i and the 
finite population approximations to the variance of the 
regression estimator indicate that there are potential efficiency 
gains to be made from variable selection for this population. 
To investigate this for our data we now proceed to describe 
the details of the simulation study. 

For each sample replicate (say s) and for each of the two 
alternative sets of auxiliary variables considered, estimates of 
the population mean of total monthly income were computed, 
as well as corresponding variance estimates, using a number 
of estimation strategies. Each estimation strategy is defined 
as a combination of a subset selection procedure, an estimator 
for the mean and a corresponding variance estimator. The list 
of all strategies considered follows. 


SM) Sample mean estimator, with no auxiliary variables 
(y,v,). This strategy provides the standard against 
which all the others will be compared. 

Fs) Forward selection of auxiliary variables with (V,,v,). 

Fd) Forward selection of auxiliary variables with (),,v,). 

Fg) Forward selection of auxiliary variables with (y rrVe) 

Bs) Best subset selection from all subsets of auxiliary 
variables with (y eve). 

Bd) Best subset selection from all subsets of auxiliary 
variables with (V,,v,,). 

Bg) Best subset selection from all subsets of auxiliary 
variables with O »¥,): 

FI) Fixed subset of auxiliary variables with (7,,v,). 

SS) Saturated subset of auxiliary variables with (y,,v,). 

FR) Forward subset selection using SAS PROC REG, with 
(V,¥,): 

CN) Condition number reduction subset selection procedure 
with (Y,,V,). 

RI) Ridge regression estimator with saturated subset of 
auxiliary variables and a variance estimator that we 
denote v,,., proposed by Dunstan and Chambers 
(1986), V gc>Yp¢)- 


Strategies Fs to Bg are variations of the two procedures we 
proposed for subset selection arising from the use of the three 
mean squared error estimators considered in section 3. 
Strategies FI and SS use the same set of auxiliary variables 
irrespective of the sample selected. In SS the saturated subset 
including all auxiliary variables available is always used. In 
FI a subset was chosen from each of the two sets with five 
(x,,X4,X,,chosen) or ten (x,,X,,X5,Xg,x,,chosen) auxiliary 


variables considered, by applying a standard forward subset 
selection regression procedure to the population dataset. The 
selected subsets were then used for every sample, thus the 
name “fixed subset” strategy for FI. This strategy would not 
be feasible in practice because the population information 
would not be available for the response, but it was considered 
as a theoretical “best possible scenario” under the traditional 
approach. 

For the strategy FR, SAS PROC REG was used “naively” 
to perform a standard forward subset selection for each 
sample. The p-value used to decide whether a new variable 
should be included was the default of the procedure, namely 
0.50. For more details, see SAS (1990, p. 1397). 

For the condition number reduction subset selection 
strategy CN, the value used for the parameter L that controls 
the method was 1,000. For the ridge regression estimator 
strategy RI, the cost coefficients associated with calibration 
errors for different variables were all set equal to 1. After 
having chosen the value of A that guarantees all the weights 
are not less than 1/N, the weights were rescaled such that they 
sum to exactly 1, in order to ensure exact calibration when 
estimating the population size. 

For any estimation strategy, the estimates of the population 
mean and its mean squared error for the sample s are denoted 
by y(s) and v[¥(s)] respectively. The simulation results for 
each estimation strategy were summarised by computing 
estimates of the bias, mean squared error (MSE), and average 
of mean squared error estimates (AVMSE) from the set of 
1,000 sample replicates, given respectively by 


BIAS = ¥> [¥(s) - Y]/1,000 (12) 
MSE = ¥> [y(s) - ¥]*/1,000 (13) 
AVMSE = > v[(s)]/1,000. (14) 


A measure of efficiency was also calculated for each 
strategy by dividing the corresponding simulation mean 
squared error by the simulation mean squared error for the 
sample mean (strategy SM) and multiplying the result by 100. 
Empirical coverage rates for 95% confidence intervals based 
on asymptotic normal theory were also computed for each 
estimation strategy and these rates, expressed as percentages, 
are presented in the last columns of Tables 1 and 2. 

Table 1 displays the simulation results for estimation of the 
population mean of the response variable given the set of five 
auxiliary variables (x, - x,,x,,) with larger predictive power. 
In this case, the use of the regression estimator greatly 
improves precision for every estimation strategy employed, 
except for subset selection using condition number reduction 
(CN). The bias was negligible (less than 1% in terms of the 
absolute relative bias) for all estimation strategies (the 
population mean of y is 194.34) except perhaps RI, which 
displayed a slight bias. 
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Table 1 
Bias, Mean Squared Error, Average of Mean Squared Error Estimates, Efficiency and Empirical Coverage of Alternative Estimation 
Strategies for the Mean of Response Variable y with Five Auxiliary Variables (x, - x,, x,,) Available 


Estimation strategy BIAS MSE 

SM) Sample mean (V,v,) 0.25 

Fs) Forward (V,,v,) 0.40 

Fd) Forward (7 ,,v,) -1.25 188.08 
_ Fg) Forward (V,, ve -1.28 188.38 

Bs) Best (V,,v,) 0.44 

Bd) Best (¥,,v,) -1,22 190.52 

Bg) Best 0 Vy) -1.24 190.83 

FI) Fixed (¥,,v,) 0.29 

SS) Saturated (V,,v,) 0.30 

FR) PROCREG (j,,v,) 0.38 

CN) Cond. num. red. (V,,v,) 0.34 

RI) Ridge (F pV nc) 22 


' Nominal 95% coverage. 


There was no difference between the results for strategies 
based on forward selection (Fs-Fg) and corresponding stra- 
tegies based on selection from all possible subsets (Bs- Bg). 
Hence the faster and cheaper forward selection procedures are 
preferable. 

Amongst the strategies using forward subset selection, Fd 
and Fg (with v, and v, as the mean squared error estimators 
respectively) yielded greater efficiency, and performed very 
similarly. Note also that Fd and Fg performed better than FI 
and SS, the strategies that adopted the regression estimator 
with a fixed subset of the five auxiliary variables for every 
sample. This is true both for the saturated subset (SS) and 
when the fixed subset was chosen using information from the 
whole population (FI). This shows that one can do better than 
the traditional approach of using the regression estimator with 
a fixed set of auxiliary variables, by using an adaptive 
procedure that chooses the “best” regression estimator 
(subset) for each given sample, at least when the target 
response variable is the one considered for subset selection. 
This property was suggested by the wide variation in the 
values of c : between samples, where we may expect to 
benefit from a strategy which selects fewer x variables for 
samples with the largest values of c ; : 

Comparison with the adaptive strategy FR, which used the 
standard subset selection available in PROC REG of SAS, 
shows that a criterion using an appropriate estimator of the 
mean squared error of the regression estimator makes some 
difference. FR yielded similar efficiency to that of traditional 
fixed subset strategies (FI-SS). 

A more striking result is the low efficiency achieved by the 
subset selection procedure based on condition number 
reduction (CN) compared to all the other strategies based on 
the regression estimator. This was not unexpected, because 
that procedure did not take the response variable into account. 


620.09 
233.78 


236.90 


227.90 
233.58 
235.86 
507.33 
304.95 


5; aT 

AVMSE over SM(%)__Coverage (% 
619.05 100.00 91.8 
239.62 37.70 82.7 
196.88 30.33 82.0 
192.73 30.38 81.1 
239.49 38.20 82.7 
196.84 30.72 82.0 
192.71 30.77 81.1 
241.24 36.75 83.3 
242.32 37.67 82.5 
240.26 38.04 82.5 
483.63 81.82 89.8 
250.07, 49.18 82.5 


This favours the argument that when the mean of some 
specified response variable is the main target for inference, 
this should be taken into account when selecting the auxiliary 
variables to use in connection with the regression estimator. 

When the set of five auxiliary variables was considered, 
we also observed that, for every sample, the first variable 
eliminated to reduce the condition number was proxy income 
(x,,). This happened because eigenvalues (and hence condi- 
tion numbers) of the CP matrix are dependent on the units of 
measurement of the auxiliary variables. Because all other 
auxiliary variables are counts of some kind, proxy income is 
the variable with the largest variance by far. Its exclusion for 
every sample provides some explanation for the poor 
performance of this approach, because it is the best single 
predictor for the response. 

This difficulty was not apparent in Bankier’s work, because 
in the target application of his procedure, the sample data 
from the 1991 Canadian Population Census, all the auxiliary 
variables considered were counts of persons, families or 
households, thus measured in similar units. 

Unlike the eigenvalues of the CP matrix, the regression 
estimator is invariant to location and scale transformation of 
the auxiliary variables. To remove the arbitrary dependence 
of the condition number approach on the units of the auxiliary 
variables, it is therefore natural to standardise these variables 
first and to compute the condition number of the sample 
correlation matrix R, rather than XX," However this was 
tried and even modest values of L (100) failed to cause 
elimination of any auxiliary variables, which resulted in the 
saturated set being used every time, so that CN reduced to SS. 

The strategy based on the ridge regression estimator (RI) 
performed worse than the saturated subset strategy (SS) in 
terms of efficiency. It also displayed some bias for estimating 
the mean squared error. This loss of efficiency is due to the 
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requirement that all the weights should be greater than or 
equal to 1/N, which was imposed only under this strategy. On 
the other hand, it performed much better than the condition 
number reduction strategy CN in terms of efficiency. 

In terms of the empirical coverage rates, only the condition 
number reduction strategy CN performed close to SM 
(sample mean), both leading to modest undercoverage. All the 
other strategies based on regression estimation yielded similar 
coverage rates, well below the target of 95%. 

Results for the simulation carried out with the set of ten 
auxiliary variables (x, - x,,) are displayed in Table 2 below. 
As expected, these results show that the strategies that use the 
regression estimator still provide some gain in efficiency over 
the sample mean. However these gains are not as large as 
those reported in Table 1, when there are five auxiliary 
variables with higher explanatory power. As before, adaptive 
strategies based on forward subset selection performed 
similarly to their counterparts based on best subset selection 
from all possible subsets. Adaptive strategies using v, or v, 
as the estimator of the mean squared error were again slightly 
more efficient than the corresponding strategies based on v,, 
although in this case at the expense of larger undercoverage 
of the corresponding nominal 95% confidence intervals. 

The more efficient adaptive estimation strategies (Fd, Fg, 
Bd and Bg) display nonnegligible bias for both the population 
mean and for the mean squared error. In contrast, strategies FI 
and SS present no significant bias for the mean, although 
there is some bias in the mean squared error estimation under 
strategy SS. Note particularly the large negative bias of the 
estimators of the mean squared error, as indicated by the 
differences between the columns labelled MSE and AVMSE 
in Table 2. This appears to be worse for strategies Fd, Fg, Bd 
and Bg, followed by Fs and Bs, and not so bad for SS, FR 
and CN. 


Comparing Fd and Fg with CN, there is a moderate gain in 
efficiency over the condition number reduction procedure, at 
the expense of some increased bias in both the mean and 
mean squared error estimators. Thus, even when the 
predictive power of the available auxiliary variables is not 
large, it is still possible to gain efficiency over strategy CN. 

A bad choice of fixed subset (as for example, the saturated 
subset used in strategy SS) could yield poor results in terms 
of efficiency and also some bias in the mean squared error 
estimation. However, if for example v, was used as the 
estimator for the mean squared error under strategy SS instead 
of v,, there would be no apparent bias (the AVMSE observed 
in that case was 459.67, hence much closer to the estimated 
simulation mean squared error of 462.71). 

The ridge regression estimator was again slightly inferior 
to the saturated subset strategy (SS), but now without any 
apparent bias in estimating the mean or the mean squared 
error. It outperformed the condition number reduction strate- 
gy CN once again in terms of efficiency, albeit by a smaller 
margin. It also performed well in terms of empirical coverage. 

Strategy FR performed similarly to the fixed subset strate- 
gies FI and SS again, and so was outperformed by strategies 
using a specialized criterion based on an estimator of the mean 
squared error of the regression estimator such as v, or v,. 

These results suggest that, when estimating the population 
mean of a single response, the proposed adaptive procedures 
combining the regression estimator with some form of subset 
selection based on an appropriate mean squared error estima- 
ator can offer some useful improvements in efficiency 
against its competitors. However such strategies may 
introduce some bias when the predictive power of the 
auxiliary variables available is not large, and _ the 
corresponding MSE estimators may be substantially biased, 
leading to poor coverage. 


Table 2 
Bias, Mean Squared Error, Average of Mean Squared Error Estimates, Efficiency and Empirical Coverage of Alternative Estimation 
Strategies for the Mean of Response Variable y with Ten Auxiliary Variables (x, —x,,) Available 


Estimation strategy BIAS MSE 
SM) Sample mean (jy, v,) 0.25 620.09 
Fs) Forward 0, v.) 0.06 468.46 
Fd) Forward (),,v.,) -8.12 434.27 
Fg) Forward (J, vy) -7.90 433.71 
Bs) Best (j,,v,) -0.00 466.16 
Bd) Best (V,,v,) -7.90 434.54 
Bg) Best Vy) -7.60 433.26 
FI) Fixed (V,,v,) 0.45 490.49 
SS) Saturated ( ,, v,) -0.20 462.71 
FR) PROCREG (j,,v,) -0.07 466.13 
CN) Cond. num. red. (¥,,v,) 3.49 562.91 
RI) Ridge (2 Vp-) 1.05 480.18 


' Nominal 95% coverage. 


AVMSE over SM Coverage () 
619.05 100.00 91.8 
397.99 12S) 86.7 
338.90 70.03 81.7 
328.46 69.94 81.6 
397.59 75.18 86.6 
336.88 70.08 81.5 
326.05 69.87 81.6 
461.86 79.10 89.0 
413.17 74.62 86.9 
399.34 TAT 86.4 
450.36 90.78 87.3 
472.82 77.44 89.4 
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Table 3 
Correlation Matrix for Variables Used in the Simulation Study with the 1988 Census Population 

Variable y x, x, x, X, Xs X x, Xg Xo X10 
x, 0.23 
X, -0.04 0.20 
x, 0.17 0.07 -0.40 
x, 0.47 OHS") [O15 0.12 
Xs 0.48 0.09 -0.11 0.15 0.83 
X¢ Ol05snie- 0:09 03255 0:03 0.22 0.20 
X, (OL OO» “Ole OO =O =O).3i 0.16 
Xg 0.38 0.29 0.07 0.17 0.44 0.41 ONS e020 
Xo 0.20 0.08 -0.06 0.04 0.30 0.25 ONG. ONS 0.37 
X10 0.43 0.23 0.33 0.17 0.39 0395 Os10eee 0:30 0.49 0.26 
Xx 0.78 02359-0100 0.22 0.54 0.54 QO SOs 0.41 0.21 0.49 


7. CONCLUSIONS AND FUTURE DIRECTIONS 


Our results suggest that, when using regression estimation, 
there is potential for some gain in efficiency by adopting a 
variable selection procedure based on one of the mean 
Squared error estimators v, or v,. Under SRS, and 
considering the limited simulation evidence, there seems little 
to choose between these two mean squared error estimators. 

Forward subset selection procedures were as effective as 
those based on searches carried out considering all possible 
subsets, which involve much more computation. Our results 
also indicate that it is possible to improve over subset 
selection procedures based on condition number reduction 
whenever a specific response variable is of interest. 

One problem with a variable selection approach is that the 
associated variance estimation is likely to become biased for 
the estimation of the overall mean squared error of the 
regression estimator following variable selection, thus leading 
to poor coverage of standard confidence interval procedures. 
Further research is necessary to investigate possible 
alternative variance estimation procedures. 

This paper has focused on the use of regression estimation 
to reduce sampling variance in the classical sampling context. 
In practice, regression estimation is widely used to correct for 
biases arising from non-sampling errors. In such applications 
the question of how many auxiliary variables to use is also an 
important one. Some variables might be included for reasons 
unrelated to sampling error, for example because they are 
known to be important determinants of nonresponse. 
Nevertheless, as the number of auxiliary variables increases 
the sampling variance may also eventually increase and we 
suggest that a decision rule to limit the number of auxiliary 
variables employed might still usefully be based on sampling 
variance considerations. In the presence of nonsampling bias, 
the difference between ¥ and X will generally be of O,(1) 
not O,(n""”) and so the results of this paper are not directly 


applicable. Further research is therefore needed to consider 
the extension of our approach to this case. 

Further research is also necessary to extend our approach 
to complex sampling designs. One possible approach for the 
general regression estimators, considered e.g. by Sarndal et al. 
(1992, sec. 6.4), would be to replace the weights g, by the 
“generalized” weights, described by Sarndal et al. (1992, 
eq. 6.5.9), and to base variable selection on the minimization 
of the generalized version of v, given by Sarndal et al. (1992, 
eq. 6.6.4). 
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Diagnostics for Formation of Nonresponse Adjustment Cells, 
With an Application to Income Nonresponse in the U.S. 
Consumer Expenditure Survey 


JOHN L. ELTINGE and IBRAHIM S. YANSANEH!' 


ABSTRACT 


This paper discusses the use of some simple diagnostics to guide the formation of nonresponse adjustment cells. Following 
Little (1986), we consider construction of adjustment cells by grouping sample units according to their estimated response 
probabilities or estimated survey items. Four issues receive principal attention: assessment of the sensitivity of adjusted 
mean estimates to changes in k, the number of cells used; identification of specific cells that require additional refinement; 
comparison of adjusted and unadjusted mean estimates; and comparison of estimation results from estimated-probability 
and estimated-item based cells. The proposed methods are motivated and illustrated with an application involving 
estimation of mean consumer unit income from the U.S. Consumer Expenditure Survey. 


KEY WORDS: Incomplete data; Missing data; Quasi-randomization; Response propensity; Sensitivity analysis; 


Weighting adjustment. 


1. INTRODUCTION 


1.1 Problem Statement 


Survey analysts often use adjustment cell methods to 
account for nonresponse. The main idea is to define groups, 
or “cells”, of sample units which are believed to have approx- 
imately equal response probabilities, or approximately equal 
values of a specific survey item, e.g., income. Weighting 
adjustment or simple hot-deck imputation then is’carried out 
separately within each adjustment cell. The resulting adjusted 
estimator of a population mean or total will have a 
nonresponse bias approximately equal to zero, provided the 
within-cell covariances between survey items and response 
probabilities are approximately equal to zero. 

Some previous nonresponse-adjustment work formed 
adjustment cells through combinations of simple demographic 
or geographical classificatory variables. However, Little 
(1986) and others considered formation of cells by direct 


grouping of sample units according to their estimated 


response probabilities or estimated item values. The present 
paper discusses some simple diagnostics that are useful in 
implementing these cell-formation ideas. Principal attention 
is directed to the sensitivity of results to the number of cells 
used; identification of specific cells that require additional 
refinement; comparison of adjusted and unadjusted mean 
estimates; and comparison of estimation results from 
estimated-probability and estimated-item based cells. These 
diagnostics are illustrated with income data collected in the 
U.S. Consumer Expenditure Survey. 


1.2 Notation, Nonresponse Bias, and Adjustment Cells 


Let U be a fixed population of size N with survey items 
Y,,i¢ U; and consider estimation of the population mean 


Y=N™ 'Y cr Y,- A sample s of size nis selected from U, and r, 
is the probability that unit 7 is included in the sample. 

Nonresponse is assumed to satisfy the following quasi- 
randomization model (Oh and Scheuren 1983). Let R, be an 
indicator variable equal to | if the selected sample unit / is a 
respondent and equal to 0 otherwise. Assume that the R, are 
mutually independent Bernoulli (n,) random variables, where 
the fixed response probabilities n, are allowed to differ across 
units. In addition, define the survey weights i, = 7, ' and the 
unadjusted survey-weighted mean response 


aes (E28) LAR, (1.1) 
i€s ies 

Because of differences among the n,, the unadjusted 
eosin Y has a nonresponse bias approximately equal to 

NOW yn, (Y,- Y), where 1=N~'Y,.,.n, and expecta- 
tions are taken over both the original sample design and the 
quasi-randomization model. To reduce this bias, one often 
partitions the population into k “adjustment cells” U,, 
partitions the sample s into corresponding groups s,, and then 
uses the adjusted estimator 


y oe 5 7 (1.2) 
min Wh? nr : 


where w, = (Dic, Aj) 14 gpand sel -ca( 5 2sikiReie 
Lies MR: Y,. Note that if k = 1, then estimators ({.1 1) and 
(1.2) are identical. For some general discussion of adjustment 
cell methods see, e.g., Cassel, Sarndal and Wretman (1983), 
Oh and Scheuren (1983), and Kalton and Maligalig (1991). 

The adjusted estimator Y , has remaining nonresponse bias 
approximately equal to 


k 
Noy Tee CaN) paras (1.3) 


h=) icU, 


' John L. Eltinge, Department of Statistics, Texas A&M University, College Station, TX 77843-3143, U.S.A.; Ibrahim S. Yansaneh, Westat, 1650 Research 


Blvd., Rockville, MD 20850-3195, U.S.A. 


34 


where N, is the number of units in U, and (i,,Y,) = 
N,! Nee y,(» ¥,)- Consequently, one prefers to construct cells 
such that the population covariance between n, and Y, is 
approximately equal to zero within each cell. In practice, one 
attempts to accomplish this by constructing cells that are 
approximately homogeneous in the response probabilities n, 
or in the items Y,, or both. In some cases, “natural” sets of 
cells are defined a priori through combinations of 
classificatory variables that are available for both respondents 
and nonrespondents. For example, Ezzati and Khare (1992) 
used 72 cells defined by age, race, region, urbanization status, 
and household size to perform nonresponse adjustments for 
part of the National Health and Nutrition Examination 
Survey. In many practical cases, however, the list of 
reasonable candidate variables for cell formation is fairly 
large, and may produce a substantial number of cells that 
contain few, if any, respondents. Consequently, several 
authors have developed methods to screen out the less 
important classificatory variables and to collapse sparse 
adjustment cells in a way that preserves a reasonable degree 
of homogeneity within each of the remaining cells. See, e.g., 
Tremblay (1986); Lepkowski, Kalton and Kasprzyk (1989); 
Kalton and Maligalig (1991); Goskel, Judkins and Mosher 
(1991); and the related discussion of pooling of poststrata in 
Little (1993). In addition, adjustment cell methods are related 
to other methods like regression-based adjustments (e.g., Rao 
1996, Section 2.4 and references cited therein) and general- 
ized raking (Deville, Sarndal and Sautory 1993). 


1.3. Adjustment Cells Based on Estimated Response 
Propensities or Predicted Items 


Adjustment cells are expected to be approximately 
homogeneous, so one may argue that such cells implicitly 
define a model for either the n, or Y, values, or both. More 
explicit modeling leads to two related cell formation methods. 
First, let _X, be a vector of auxiliary variables observed for 
both responding and nonresponding sample units i, and use 
the sample (R,,X,) values to fit a model for n, = n(X;) 
through linear, logistic, or probit regression. The sample cells 
S,, are then formed by grouping the sample units according to 
their estimated response probabilities fj,. As a second 
alternative, consider regression of responses Y, on an 
auxiliary vector X, to produce estimated items Yy , for both 
responding and nonresponding sample units. The sample 
cells s, are then formed by grouping units according to the 
values Y,. 

These two methods were suggested by Little (1986), 
extending the observational-data propensity-score work of 
Rosenbaum and Rubin (1983, 1984). See also David, Little, 
Samuhel and Triest (1983). These ideas were developed 
originally in a model- based context, but extend directly to the 
current framework. Little (1986) argued that use of cells 
based on either the f], or y , values could reduce nonresponse 
bias, and that the Y, -based cells could also control variance. 
Also, in some cases the f, and Y -based cells can be more 
flexible than cells defined a priori. In addition, the 
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i% ;-based adjustment cells are conceptually related to optimum 
stratification ideas (e.g., Cochran 1977, Sections 5A.7-5A.8). 

Little (1986) did not propose a specific rule to determine 
cell divisions. However, in keeping with related observa- 
tional-data work by Cochran (1968) and by Rosenbaum and 
Rubin (1984), one may consider cell divisions defined by the 
estimated k~'j quantiles of the fj, or e populations, 
j =1,2,...,4- 1. This equal-quantile method gives some 
control over the expected number of respondents in each cell. 
In addition, review of the preceding two references suggests 
that, for a given set of predictors X,, most of the feasible bias 
reduction may be achieved with a relatively small number of 
cells, say k =5. A case study by Czajka, Hirabayashi, Little 
and Rubin (1992) used k =6 f,-based adjustment cells 
within each of several strata, using cell-formation rules that 
were somewhat more complex than the equal-quantile rule 
considered here. However, the potential adequacy of a small 
number of cells should not be over-interpreted. For example, 
if an important regressor is omitted, then the resulting 
cell-based adjusted estimators may retain a substantial amount 
of bias, regardless of the specific number of estimated- 
probability or estimated-item based cells used. 

Finally, an important alternative to weighting adjustment 
is imputation. For example, simple hot-deck imputation 
replaces a missing value within a given adjustment cell by 
randomly selecting respondent donors from the same cell. In 
parallel with (1.1) and (1.2), the resulting mean estimator is 
re =(¥.,4,) '¥j-,4,¥,, where Y,” is either an observed or 
imputed value, as appropriate. Practical applications often 
use weighting adjustment for unit nonresponse and 
imputation for item nonresponse. However, for a given set of 
cells, both the weighting adjustment point estimator (1.2) and 
the imputation estimator Y an have the same approximate bias 
(1.3). For simplicity, the remainder of this paper will focus 
on weighting adjustment, but one should bear in mind that for 
a given set of cells, the same bias-reduction issues arise 
regardless of whether those cells are used for weighting 
adjustment or simple hot deck imputation. 


1.4 Outline of the Present Paper 


This paper discusses some implementation details of the 
estimated-probability and estimated-item methods of cell 
formation. We devote special attention to diagnostics to 
identify problems in a specific set of cells, and motivate and 
illustrate these diagnostics with an extended example 
involving income nonresponse in the U.S. Consumer 
Expenditure Survey. Section 2 gives some general back- 
ground on this income nonresponse problem. Section 3 
describes and applies several diagnostics, including 
comparison of Y, estimates and standard errors for several 
values of k (Section 3.1); partial assessment of within-cell 
bias (Section 3.2.1); assessment of cell widths relative to the 
precision of fj, estimates (Section 3.2.2); and comparison of 
the adjusted and unadjusted mean estimates Y, and Y, 
(Section 3.3). Section 4 shows that similar diagnostics can be 


A 


applied to adjustment cells based on predicted incomes Y,, 


ul 
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and also compares the mean income estimates computed from 
estimated-probability and estimated-income based cells. 
Section 5 summarizes the main ideas used in this paper, and 
notes some areas for future research. 


2. INCOME NONRESPONSE IN THE 
U.S. CONSUMER 
EXPENDITURE SURVEY 


2.1 The Consumer Expenditure Survey, Weighting 
Methods and Variance Estimation 


The U.S. Consumer Expenditure Survey (CE) is a 
stratified multistage rotation sample survey conducted by the 
Census Bureau for the Bureau of Labor Statistics. Sample 
elements are “consumer units”, roughly equivalent to 
households. In the interview component of this survey, each 
selected sample unit is asked to participate in five interviews. 
The current CE weighting procedure accounts for initial 
selection probabilities, a noninterview adjustment, post- 
stratification based on several demographic variables, and 
additional refinements; see Zieschang (1990) and United 
States Bureau of Labor Statistics (1992). The complexity of 
the CE weighting work has led the BLS to use variance 
estimators based on pseudo-replication methods with 44 repli- 
cates. This pseudo-replication is approximately equivalent to 
standard balanced repeated replication (Wolter 1985, Ch. 3). 
All standard errors reported here are based on this pseudo- 
replication method, with all additional parameter estimation 
and weighting adjustment steps performed separately within 
each replicate. 


2.2 Income Nonresponse 


The noninterview adjustment in the current CE weighting 
procedure is generally considered to account adequately for 
unit nonresponse, e.g., noncontact or refusal to participate in 
a specific interview. Thus, unit nonresponse in the CE will 
not be considered further here. However, the BLS has had 
concerns about possible bias in mean income estimates due to 
item nonresponse that occurs with income questions in the 
CE; some background is as follows. 

Detailed income data are collected in the second and fifth 
interviews of the CE, and are used to produce estimates of 
mean consumer unit income (U.S. Bureau of Labor Statistics 
1991) and other parameters. CE income data are collected 
through a complex set of questions, and nonresponse rates for 
these questions are relatively high. To provide a summary 
indication of response or nonresponse to the full set of 
income questions, the BLS classifies each second- or 
fifth-interview consumer unit as a complete or incomplete 
reporter of income. The formal definition of “complete 
income reporter” status is fairly complex; Garner and 
Blanciforti (1994) give a detailed discussion. Current BLS 
procedure estimates mean income with the unadjusted mean 
response Y, defined by (1.1), with the R, equal to indicators 
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of complete income reporting, Y, equal to income, and 
weights i, as described in Section 2.1. The weighted mean Y, 
uses both second- and fifth-interview data from a specified 
time period, but does not make direct use of the CE panel-data 
structure. In parallel with this, the present paper will 
distinguish between second- and fifth-interview data only in 
the construction of fj, and i models. 

Here, we used data from the second and fifth interview 
reports from all consumer units that had a second interview 
scheduled during 1990. The second-interview data involved 
5,125 interviewed units and the fifth-interview data involved 
5,093 interviewed units. For each interviewed unit (both the 
complete and the incomplete income reporters), BLS records 
provided a large number of demographic and expenditure 
variables; these were used as auxiliary variables in the 
modeling work described in Sections 3 and 4 below. For both 
the second and the fifth interviews, approximately 14 percent 
of the interviewed consumer units were incomplete income 
reporters. 


3. CELLS BASED ON ESTIMATED RESPONSE 
PROBABILITIES 


We first considered construction of adjustment cells based 
on. estimated response probabilities. Logistic regression 
models for the complete-income-reporter probabilities 
n, =n(X,) were fit separately for the second and fifth 
interview data described in Section 2. Model fitting details, 
including model parameter estimates and standard errors, are 
reported in Yansaneh and Eltinge (1993). All variance 
estimates were computed by the pseudo-replication method 
described in Section 2. The final model fits were used to 
estimate complete-reporter probabilities |, for each second- 
and fifth-interview unit. Following the strategy in Section 
1.3, units were grouped according to their fj, values into a 
total of k cells, with cell boundaries defined by the 
equal-quantile method. 


3.1 Initial Sensitivity Analysis for the Number of 
Cells Used 


The first three columns of Table 1 report the adjusted point 
estimates Y , of mean income, and associated standard errors, 
for several values of k. Comparisons of these point estimates 
indicate the extent to which the adjusted estimates are 
sensitive to a specific choice of k. For k > 5, the reported 
point estimates are relatively stable, varying between $32,630 
and $32,664. This is consistent with the suggestion in Section 
1.3 that k =5 cells may provide most of the effective bias 
reduction to be obtained from a given cell-formation method; 
see Rosenbaum and Rubin (1984, Section 1 and Appendix A) 
for some related mathematical background. : 

In addition, note that for k > 3, the standard errors of Y k 
are also relatively stable, ranging from $508 to $530. This is 
in partial contrast with the general idea that selection of an 
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appropriate number of cells hinges on a bias-variance 
trade-off. For the present dataset, it appears that the effective 
bias reduction occurs fairly quickly (at k =5, say), while 
substantial variance inflation does not occur until some point 
beyond k=20. This is not unreasonable, since even for 
k = 20, the number of income responses per cell remained 
fairly large (ranging from 461 to 569), and thus avoided the 
general unstable-estimator problem associated with increasing 
numbers of sparse cells. Conversely, bias-variance tradeoff 
problems may be more severe for moderate k in applications 
involving smaller effective sample sizes, e.g., estimation for 
small subpopulations. 


Table 1 
Adjusted Estimates of Mean Income with Cell Boundaries 
Determined by Estimated Response Probability Quantiles 


piri dt na araenhimS ut) Yh 
Unadjusted 

(k=1) 32,967 569 N/A N/A 
k=3 cells 32,736 530 112 1.30 
k=4 cells SPI 518 122 1.28 
k=5S cells 32,630 523 138 53) 
k=6 cells 32,664 S15 122 lesy 
k= 10 cells 32,640 514 116 1.58 
k=15 cells 32,638 515 118 1.58 
k= 20 cells 32,634 508 118 1.63 


3.2 Two Simple Cell Diagnostics 


To complement the preceding sensitivity analysis, it is 
useful to study some sets of adjustment cells in additional 
detail. Let C, = {5,,...,5,} be a given candidate set of adjust- 
ment cells, e.g., the k =3 or k =5 equal-quantile- division 
cells in Section 3.1. The cells in C, can be refined by using 
equal-quantile divisions with a larger value of k; or by directly 
splitting one or more of the cells in C,. This refinement may 
be worthwhile if there are empirical indications: (1) that the 
within-cell mean estimator Y, nr May be substantially biased; 
or (2) that a cell is wide relative to the precision with which 
the n, values are estimated. Subsections 3.2.1 and 3.2.2 use 
two simple diagnostic methods to address issues (1) and (2), 
respectively. In each subsection, the proposed diagnostics 
lead to identification of potential “problem cells”, and to 
construction of a refined set of adjustment cells, C,, say. 
Comparisons of estimates of Y based on C, and C, then 
lead to some conclusions regarding the icra ay of 
f|,-based adjustment cells. 


3.2.1 Assessment of Within-Cell Bias 


As noted in Section 1.2, a given adjusted estimator 6 
reduces, but may not entirely eliminate, nonresponse bias; and 
the residual bias of Y, depends on the biases of the within- 


Eltinge and Yansaneh: Nonresponse Adjustment Cells 


cell mean estimates Y Ae Consider the alternative within-cell 
mean estimator 


[Ean pa AeaR Ys (3.1) 
TES, ies, 

If the fj, estimates were equal to the true response 
probabilities n,, then (3.1) would be an approximately 
unbiased estimator of the true subpopulation mean 14 In that 
case, an estimator of the within-cell bias E(Y, ee ’,) would 
be B, Sy Pie -¥Y im} and the corresponding estimator of the 
overall bias E(Y,- Y) would be B=(;. an hy) 
vs (Ques 1B, 

Because the fj, values are subject to estimation error, the 
terms B, and B give only a partial indication of potential bias 
eiapigis For example, a large value of B, may reflect a 
substantial bias in Y np» OF May reflect biases in the alternative 
estimator Y, due to the errors fj, - 1,3 of the cautionary 
remarks in Little (1986, p. 146) regarding direct use of the 
weights A, in adjusted estimation of Y. Thus, if one 
observes a large value of B,. it is worthwhile to consider 
refinement of cell h; but the final decision of whether to use 
the resulting refined set of cells will depend on whether the 
refined set produces a substantially different estimate of the 
overall mean Y. 

Tables 2 and 3 present B, values and associated standard 
errors and f¢ statistics for equal-quantile-division cells with 
k =3 and k =S, respectively. Note that for the case k = 3, 
the B ,, diagnostics indicate a possible bias contribution from 
the lowest cell. This is consistent with the suggestion from 
Section 3.1 that k =3 cells may not provide a satisfactory 
nonresponse adjustment. In addition, the corresponding value 
of B was 11 1, with a standard error of 75; this value of B is 
very close to the difference Y, - ¥, = 106 of the estimates Y, 
and Y, from Table 1. 


’ Table 2 
Within-Cell B, Statistics for Probability-Based Cells, k = 3 


h B, se(B,) t= B,/se(B,) 

1 269 136 1.98 

D =|) 43 -0.44 

3 84 45 1.87 
Table 3 


Within-Cell B ;, Statistics for Probability-Based Cells, k = 5 


h B, se(B,) t= B/se(B,) 
1 96 ING 0.44 
2 = 7?) 116 -0.62 
3 = 52 56 =0.93 
4 -16 27 -0.59 
5 98 50 1.96 


In light of the preceding results, the low- fj, cell from the 
k =3 case was split in half. The upper bounds for the two 
new cells (h = 1’ and h=1”, say) were determined by the 
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0.167 and 0.333 estimated quantiles of the f, population. 
The resulting B, values and standard errors were 90 and 197 
for cell 1’, af -42 and 79 for cell 1”. In addition, the 
refined set of four cells had B = 30, with a standard error of 
75; and the adjusted estimate of Y equal to $32,652 and 
standard error of $518 were close to those obtained from the 
equal-quantile-division method with k = 5. 
In contrast with the results for k = 3, the B , results for k = 5 
indicated relatively little basis for concern, with the possible 
exception of cell = 5, which had a ¢ statistic of 1.96. For 
k =5, the value of B-was 11, with a standard error of 93. 
Additional splitting of cell h=5 did not lead to notable 
changes in either the estimate of Y or the associated standard 
errors. The B, results, for equal-quantile-division cells with 
larger values of k showed even fewer indications of within- 
cell bias. For example, for & = 6 all six cells had B, values 
with f statistics less than or equal to 1.65; and for k = 10, all 
cells had B, values with ¢ statistics less than or equal to 1.54. 


3.2.2 Relation of Cell Widths to Precision of 4; 
Estimates 


The relationship between the widths of adjustment cells 
and the widths of confidence intervals for the response 
probabilities n, leads to another diagnostic for identification 
of potential problem cells. First, define a, = en A,R) : 
Nee A,, the nonresponse-adjustment factor used for 
responding units in cell h. Second, following standard results 
for logistic regression, note that an approximate 95% 
confidence interval for n, is 


(LB,, UB,) = ({1 + exp{-X76 + 1.96D/"}]"! 
[1 + exp{-X’6- 1.96D;7}]"», 


where 6 is the vector of logistic regression parameter 
estimates, D, = X; V, X;,, and V, is the pseudo-replicate-based 
estimated covariance matrix on 6. Let d, be the A,-weighted 
sample mean of the confidence interval eae UB. - LB, for 
units 7 in cell h, and consider a comparison of d, to the width 
of cell h. If cell A is relatively wide, both on an absolute scale 
and relative to ay then division of this cell may produce two 
new cells with two substantially different weight factors a,. 
Conversely, if d, is substantially larger than the width of cell 
h, then differences among {j, in that cell may result more from 
estimation error than from differences in the true n,. In that 
case, additional division of cell h is unlikely to produce much 
useful change in weight factors a,; and thus there will be 
relatively little change in the resulting nonresponse-adjusted 
estimator of Y. 

Tables 4 and 5 report cell boundaries, cell widths, d,, and a, 
values for k=5 and k= 10, respectively. For k=5, the 
widths of cells 2 through 5 were not large relative to the d, 
values. Each of these cells is essentially split in half to 
produce the k = 10 cell case. The resulting pairs of a, for 
k = 10 are relatively close to the corresponding a, values in 
cells 2 through 5 with k = 5. 
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By contrast, with k = 5, cell 1 is over twice as wide as d,. 
When &k = 10, this cell is divided into cells with somewhat 
different nonresponse adjustment weight factors a,: 1.45 and 
1.27, respectively. However, the corresponding cell-mean 
estimates are relatively close: ie p= $24,045 and eS = 
$24,582 for k=10. Thus, in this example, the non- 
response-adjusted estimates Y, and Y,, are relatively close 
because four of the five cell divisions produced relatively 
small changes in weights, and because the other cell division 
produced two cells with similar cell means. 


Table 4 
Estimated-Probability Cell Boundaries, Cell Widths, Mean 
Confidence Interval Widths and Nonresponse Adjustment 
Factors, k = 5 


Cell = 


h Lower Upper d, a, 


Bound bound Width 
0.384 0.810 0.426 0.197 1.35 
0.810 0.861 0.051 0.139 1.20 
0.861 0.894 0.033 0.110 1.13 
0.894 0.924 0.030 0.088 1.08 
0.924 0.994 0.070 0.067 1.07 


Ak WN 


Finally, the a, factors in Table 5 indicate that mean 
response rates in the k = 10 cells fall in a moderate range, 
from (1.45)! = 0.69 to (1.06)! = 0.94. Some other non- 
response datasets involve a wider range, and thus are more 
likely to produce more pronounced cell-splitting results. 
Conversely, other nonresponse datasets may display a tighter 
distribution of response probabilities, and thus are less likely 
to display notable cell-splitting effects. 


Table 5 
Estimated-Probability Cell Boundaries, Cell Widths, Mean 
Confidence Interval Widths and Nonresponse Adjustment 
Factors, k = 10 


Lower U ll z 

ee Boand Hound oe d, 4, 

1 0.384 0.762 0.378 0.220 1.45 
2 0.762 0.810 0.048 0.174 1.27 
3 0.810 0.840 0.030 0.146 1.21 
4 0.840 0.861 0.021 0.132 1.19 
5 0.861 0.878 0.017 0.111 1.14 
6 0.878 0.894 0.016 0.108 jig 
7 0.894 0.908 0.014 0.093 1.09 
8 0.908 0.924 0.016 0.083 1.08 
9 0.924 0.944 0.020 0.072 1.08 
10 0.944 0.994 0.050 0.062 1.06 


3.3. Comparison of Cell-Based Estimates to the 
Unadjusted Estimate 


To conclude the assessment of fj,-based cells, we 


compared the adjusted estimates Y, with the unadjusted 
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estimate Y,. First, Table 1 indicates that for the reported 
values of k > 5, the differences Y, - Y, are greater than or 
equal to $303. Second, for k>=5, the estimated standard 
errors of the differences Y, - Y , are all less than or equal to 
$138, and the corresponding ¢ statistics are all greater than 
2.44. Thus, for k =5, say, a formal test of the hypothesis 
H,:E(Y, - Y,) = 0 would be rejected at standard significance 
levels; i.e., the adjustment-cell method has produced a 
significant change in the mean income estimate. 
In addition, a rough comparison of the efficiencies of i 

and Ye follows from the estimated mean squared error ratio 


A 


9, = (VY Dy ,) + max{0,(¥,-¥,)?-VY,-¥ DH 


where V(Y,), WY,), and WY, -Y,)are the pseudo- 
replicate-based variance estimates for the indicated means. 
To interpret this ratio, assume for the moment that Y, is an 
approximately unbiased estimator of Y. Then 7, is an 
estimator of the mean squared error of the unadjusted 
estimator Y,, relative to the mean squared error of Y,. 
Conceauently 7, reflects the loss of efficiency incurred by 
using the biased, unadjusted estimator Y instead of the 
adjusted, unbiased estimator Y,. However, this interpretation 
should be viewed with some etion since it depends on the 
assumption that Y, is approximately unbiased for, Y, and 
since the Y, are functions of the random terms Y, - Y 
WY,), WY), and VY, - ¥,). 

As suggested by a referee, one could also consider a mean 
squared error ratio 


k? 


A 


(VY) PY) + max{0,(7,- ¥,)*- PY, - FH 


where 1g equals expression (1.1) with A, replaced by 
(A, ie ,. This would amount to comparing each cell-based 
estimate ie to Y_.. This is appropriate if va is approximately 
unbiased, but this unbiasedness may be problematic i in some 
cases; cf. Little (1986, p. 146). 

The final column of Table 1 reports the estimated ratios 7, 
for specified values of k. For k 25, each reported 7, is 
greater than 1.5. Finally, note that each adjusted estimate Y, 
fell below the unadjusted estimate Y,. This occurred 
because, for a given k, cells associated with larger response 
probabilities tended to have larger mean estimates y Ape ror 
example, for k =5, the Y,, values were $24,333, $33,729, 
$33,398, $34,620, and $37,057 for h =1 (the low f,,cell) 
through h = 5 (the high f, cell), respectively. 


4. CELLS BASED ON ESTIMATED 
INCOME VALUES 


The general diagnostic ideas of Section 3 also apply to y 
based cells. To illustrate this idea, we fit separate weighted 
regressions of Y, = reported income for second- and 
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fifth-interview respondents. Yansaneh and Eltinge (1993) 
report details of the work, including parameter estimates and 
standard errors. The resulting regression models were used 
to compute estimated incomes Vs for both complete and 
incomplete income reporters. Units were then grouped into 
cells according to their fe values, with cell boundaries 
determined by the equal-quantile method. 

Table 6 reports the basic sensitivity-analysis and efficiency 
results for the Ke based cells; the organization of this table is 
the same as in Table 1. The sensitivity-analysis results are 
qualitatively similar, but not identical, to those reported for 
the f,-based cells. In additional work not detailed here, we 
considered splitting individual equal-quantile Ne -based cells. 
For k>4, the resulting mean estimates and associated 
standard errors did not differ notably from those reported in 
Table 6. 


Table 6 
Adjusted Estimates of Mean Income with Cell Boundaries 
Determined by Estimated Income Quantiles 


ee 
Unadjusted 

(k= 1) 32,967 569 N/A N/A 
k=3 cells S222 509 106 2.01 
k=4 cells 32,468 Sil2 108 2.14 
k=5S cells 32,473 511 115 2A2 
k= 6 cells 32,492 508 117 2.08 
k= 10 cells 32,488 510 119 2.07 
k= 15 cells 32,478 504 124 2.16 
k= 20 cells 32,495 Sil) 124 2.02 


The final two columns of Table 6 permit comparison of y k 


to the unadjusted estimate Y,. For k2 4, the differences 
Y,-Y, are greater than or equal to $472, with estimated 
standard errors less than or equal to $124. The associated t 
statistics are all greater than 3.80. In addition, the estimated 
mean squared error ratios Y, are all greater than 2.0. 

Also, the f, and 4 -based cells produced somewhat 
different adjusted estimates of mean income, but the observed 
differences were not statistically significant at customary a 
levels. For example, with k = 5, the difference between the 
f,- and Ve -based cell estimates is $32,630 - $32,473 = $157, 
with a standard error of $122 and a f¢ statistic of 1.29. 
Similarly, for k = 10, the difference between the f),- and Pr 
based estimates is $152, with a standard error of $104. Thus, 
the data provide relatively little power to distinguish between 
results of the two general cell-formation methods. 

Finally, note that a given set of Y,-based cells are 
fundamentally linked with a particular Y variable, e.g., 
consumer unit income. Consequently, that set of cells will 
not necessarily work well for estimation of the mean of a 
different Y variable. 
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5. DISCUSSION 


5.1 Summary of Methods 


This paper has discussed some simple diagnostics for 
formation of nonresponse adjustment cells. The methodology 
may be summarized as follows. 


1. Based on preliminary modeling work and observed 
auxiliary variables X,, compute an estimated response 
probability fj, for each sample unit (respondents and 
nonrespondents). 


2. Construct & adjustment cells with boundaries determined 
by the estimated k~'j quantiles of the f, population, 
j =1,2,...,.4- 1. Compute the resulting adjusted mean 


estimate, Y,. 


3. Repeat (2) for several integers k > 1. As k increases, 
identify the point at which the Y, become approximately 
constant. In keeping with Rosenbaum and Rubin (1984) 
and the empirical results discussed here, values of k near 
5 may be of special interest. 


4. Use simple screening diagnostics (e.g., B, and d, in 
Section 3.2) to check for potential problems in the 
equal-quantile-division adjustment cells. If the dia- 
gnostics identify potential “problem cells,” then try 
additional refinements of these cells. Compute estimates 
of Y based on these refined sets of cells, and compare 
these new estimates to the Y , {rom (3). 


5. Assess the overall effect of adjustment by comparing the 


differences Y, - Y, to the standard errors se(Y, - Y,); 


and by computing the estimated mean squared error ratios 
Vy: 
6. Repeat steps (1) through (5), as appropriate, for Fr. ;-based 


adjustment cells. Compare the final estimates of Y 
obtained from the fj, and Y,-based cell methods. 


5.2 Areas for Future Research 


The results of this work suggest two potentially useful 
areas for future research. First, the CE income nonresponse 
problem is similar to nonresponse problems in some other 
large-scale surveys, but as with any case study one should not 
over-generalize the empirical results reported here. It would 
be useful to apply these diagnostics to problems involving 
different estimands (e.g., cross-class means) or involving 
nonresponse datasets with somewhat different characteristics, 
e.g., larger or smaller effective sample sizes; or wider or 
narrower distributions of f, estimates. This in turn would 
offer additional insight into the operating characteristics of 1, 
and Y,-based adjustment cell methods in practical 
applications. Second, extensions to multivariate problems 
(e.g., relationships involving second-interview and fifth- 
interview CE income data) also would be of interest. 
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Variance Estimation for Measures of Income Inequality and 
Polarization — An Empirical Study 


MILORAD S. KOVACEVIC and WESLEY YUNG! 


ABSTRACT 


Measures of income inequality and polarization are fundamental to the discussions of many economic and social issues. 
Most of these measures are non-linear functions of the distribution function and/or the quantiles and thus their variances 
are not expressible by simple formulae and one must rely on approximate variance estimation techniques. In this paper, 
several methods of variance estimation for six particular income inequality and polarization measures are summarized and 
their performance is investigated empirically through a simulation study based on the Canadian Survey of Consumer 
Finance. Our findings indicate that for the measures studied here, the bootstrap and the estimating equations approach 


perform considerably better than the other methods. 


KEY WORDS: Gini index; Lorenz curve ordinate; Low income proportion; Polarization index; Quantile share; 
Resampling variance estimation; Linearization method. 


1. INTRODUCTION 


Analyses of the distribution of income are fundamental to 
the discussions of important economic and social issues such 
as the extent of inequality, poverty, the size of the middle 
class, etc. There exists extensive statistical and econometric 
literature on this subject, especially on different measures of 
income inequality and their properties (Sen 1973, Kakwani 
1980, Nygard and Sandstr6m 1981). However, seldom is 
there any attempt to produce information regarding the 
sampling variability associated with the estimates used to 
assess the magnitude of inequality or polarization. Such 
information is necessary for two reasons: 1) as a measure of 
the precision of the estimates obtained from survey data and 
ii) to provide a basis for formal statistical inference on income 
distributions, particularly when income distributions are 
compared over different regions or across time. 

Measures of income inequality and polarization are finite 
population parameters expressible as functions of the ordered 


population values, thus their variances are not obtainable in 


simple formulae and one has to rely on approximate variance 
estimation techniques. Generally, inference about these 
measures, based on a complex sample design, embodies point 
estimation and confidence intervals. We investigate variance 
estimation for some of these measures such as quantiles, low 
income line, low income proportion, Lorenz curve ordinates, 
quantile shares, Gini index, and the polarization index. 
Throughout this paper we assume a fixed finite population 
framework, that is, we assume that associated with each 
population unit is a fixed but unknown real number: the value 
of income earned by the unit. We assume that the population 
is stratified into L strata with N, primary sampling units 
(PSU’s) in the h-th stratum. In the first stage sample, n,(> 2) 
PSU’s are selected from stratum h (independently across 


strata). We assume that subsampling within sampled PSU’s 
is performed to ensure unbiased estimation of PSU totals, 
Y,0¢ =1,...,.m,; 4 = 1,...,L. Attached to the (hci)-th ultimate 
unit, along with the observed variable of interest, y,_,,is the 
sampling weight w,,. We use ).=),)¥, to denote 
summation over all ultimate units in the sample, incorporating 
all stages of sampling. 

After reviewing the basic definitions of these measures, we 
give their point estimates under our sample design in section 2. 
Section 3 deals with variance estimation of these measures. 
Existing methods are reviewed and five methods, jackknifing, 
grouped and repeatedly grouped balanced half-sample, 
bootstrap and linearization via the estimating equations 
approach are summarized in detail. Section 4 contains the 
description of the simulation study based on data collected in 
the 1988 Canadian Survey of Consumer Finance. The empi- 
rical study is aimed at comparisons of the variance estimation 
methods for a number of income inequality measures. 
Various results are presented, summarized and interpreted. 
Our conclusions are presented in section 5. 


2. ESTIMATION OF INCOME INEQUALITY 
MEASURES 


The simplest measures of inequality between two 
distributions are the cumulative distribution function (CDF) 
and the quantiles of the two distributions. We start this 
section by defining the CDF and the finite population 
quantiles. The remaining measures studied in this paper are 
functions of the CDF or a fixed number of quantiles and are 
introduced in section 2.1. 

For a variable Y defined over a finite population 
U={1,...,N}, we define the CDF as 
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1 
Fy) = DY, sy}—, 
ieU N 
where Ia} is an indicator function taking on a value of 1 if a 
is true and 0 otherwise. A design unbiased estimator of 


F(y)is 
Fiy) = > 10s vie 


iés 
where the sampling weights, w,, are obtained from the 
sample design and are equal to the inverse of the first order 
inclusion probabilities. This estimator may not be a CDF 
since F(<) = N/N may not necessarily be equal to 1. Thus we 
would rather use the possibly design-biased estimator: 


fy) =Diy,<yw,/Dw,=Diy<y%, aay 
les Tes 1ES 
where w, = w,/Yw,, i¢s. The estimator (2.1) is design 
unbiased when )' w,=N which can occur under simple 
random sampling or if the weights, w,, are benchmarked to 
known population totals. In general, the estimator (2.1) uses 
final weights which usually involve poststratification, non- 
response adjustment, some iterative calibrations and so on. In 
this paper, we consider only the case where the design 
weights are used. 
Turning to the quantiles, we define the finite population 
quantiles as 


Ey(p) =inf{Y,| F,>p} for 0<p<1, 
ieU 


where F’ = F,,(Y). The population quantiles are estimated by 
the sample quantiles 


& =inf{y,| F,2p} for O<ps1, 
ies 
where fi = F( y,). Ifa Spee is a function of quantiles, 


say 9, =g(Ey) with €, = { &y(P)), event then it is 
estimated by 6 = g(é) where b= (a0 Zakeaeh 


2.1 Income Inequality and Polarization Measures as 
Finite Population Parameters 


In this section we present some frequently used income 
inequality and polarization measures. They are the low 
income line, the low income proportion, the Lorenz Curve 
and its related statistics, the quantile shares, the Gini index 
and finally the polarization curve and the polarization index. 
Our intention is to briefly introduce these measures, not to 
discuss them in detail. For more details, we refer the readers 
to Nygard and Sandstrém (1981) and Wolfson (1994). 

The /ow income line, or the poverty line, is defined as a 
fraction of the median, A, = a€, (05), where O<a< 1 isa 
given constant and €, (0.5) is the finite population median. Its 
estimate is simply A, = a & 5. 

The /ow income proportion (LIP) is the percentage of units 
(individuals, families, households) in the population falling 
below the low income line 4, and is given by A, = F,,(A,). 


The estimate of the low income proportion involves the 
estimation of both the distribution function and the low 
income line, A, = F( a) eh yore ae Wao: 

The finite population Lorenz curve ordinate (LCO) gives 
the share of income received by the poorest 100p percent of 
the population and is defined as a function of p (0 < p< 1). It 
simply depicts the cumulative income against the population 
share. As a parameter it is defined as 


(2 
L(p) = —!&, dq 
Hy 0 


where py is the population mean, and € is the quantile 
function. For a large population without ties the expression 
above is approximated by 


I{F, 
Lapeer 


and estimated as 


5 Datel: vly 
Hy N 


F< Phy Chana 
L(p) = SS eee a, 


where fi =)’ W,., y,., and lei = Cr). 

The quantile share (QS) is defined as the proportion of 
total income shared by the population allocated to a quantile 
interval (6, 6 


i eprs Frappe 


OvP),P2) f Dee : u + = Ly) Th Ly) 
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For 0 < p, <p, < | itis estimated by replacing the parameters 
with their estimates. 

The most popular measure of aggregate inequality of 
income distribution, the Gini index, is defined as the area 
between the Lorenz curve and the 45° line, normalized to lie 
between 0 and 1:G = 1-21) L(p)ap. Its finite population 
version is estimated by 


A Diy eo) 
Ce Se [ hei Wet Ww 


A hei* 


For more about the Gini index we refer the reader to Nygard 
and Sandstrom (1985). 

Using the analogy of the Lorenz curve and the Gini index, 
Foster and Wolfson (1992) defined the polarization curve as 


or in the finite population form 


Oise ae ays Ip<F,<05}¥,—, 0<p<0.5, 


U 
B(p) = e ‘ 
0.5-p+—)>> {0.5 Fi<phY,—., 0.5<ps<l. 
5 U 
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The polarization curve shows, for any population percentile, 
how far its income is from the median. The area below the 
polarization curve is considered as a summary measure of the 
polarization. A version of it, normalized to lie between 0 and 
1, is named the polarization index (PI): 


DD We SOD) peed AL 
pay UE 05 ORI 


U Ev(o.s) N 


where €,{0.5),u, and F, were previously defined. The 


estimate of the polarization index is obtained by replacing the 
parameters with their estimates. 


3. VARIANCE ESTIMATION 


The estimation of the variance of non-smooth statistics like 
quantiles, as well as quantile based functions like the low 
income proportion or the polarization index, is not straight- 
forward especially when the assumption of simple random 
sampling is untenable and there is a need to take into account 
the complex sample design. In the first part of this section we 
review some results on variance estimation for quantiles as a 
starting point for understanding the complexity of variance 
estimation for income inequality measures. We also review 
results on variance estimation for some measures like the 
Lorenz curve ordinates. The second part describes the 
methods of variance estimation that are used in this study. 

Woodruff (1952) proposed a method to obtain confidence 
intervals for individual quantiles. These intervals were used 
by Francisco and Fuller (1986) and Rao and Wu (1987) to 
derive variance estimators. Though the estimator depends on 
the confidence coefficient, Rao and Wu (1987) established its 
asymptotic consistency for any significance level a. Using 
Monte Carlo simulations, they studied the standard errors of 
quantiles for cluster samples estimated in this manner. Their 
results suggest that a 95% confidence interval works well as 
a basis for extracting the standard error. Binder (1991) 
obtained a similar form of the variance estimator by using the 
linearization method. 

Jackknife variance estimators have become extremely 
popular for smooth functions of totals and means with the 
increase in computing power. Standard asymptotic theory 
applied to the median of a distribution with bounded con- 
tinuous density, f, shows that nE(E, < - & 5)’ > 1/[4/°(& 5)] 
as n>. Efron (1979) pointed out that the jackknife method 
applied to the sample median gives a variance estimate which 
is asymptotically inconsistent since 

a 1 2 2 
nat jx(& 5) = —— [X)/2] 
4f7(E 5) 


where 327 has mean of 2 and variance of 20 which means 
that the jackknife variance estimator tends to over estimate, 
on the average, the correct asymptotic variance by 100%. 
Kovar (1987) confirmed empirically the inconsistency of the 
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delete-one-unit jackknife estimators for a stratified sample 
design. In a simulation study using a stratified population, he 
showed that the delete-one-unit jackknife estimators (he 
considered six of them) performed poorly, over estimating the 
true variance by 30-70% in the design with two units per 
stratum and performed even worse in the five units per 
stratum design. Shao and Wu (1989), however, have shown 
that under certain conditions, the delete-d jackknife method 
has desirable asymptotic properties for variance estimation of 
non-smooth statistics. This result has motivated Rao, Wu and 
Yue (1992) to apply the delete-one-PSU jackknife for 
stratified multistage sampling. In a limited simulation study 
they found that both bias and relative bias of the jackknife 
variance estimator of the median decrease as the cluster size 
increases for a fixed intracluster correlation. 

Bootstrap variance estimation for the median was first 
reported by Efron (1979), and in the case of independent and 
identically distributed observations the bootstrap provides 
consistent results, (see also Babu 1986). Rao and Wu (1988) 
gave a modified bootstrap method for variance estimation in 
stratified designs. Kovar (1987) and Kovar, Rao and Wu 
(1988) reported good performance for medians when the size 
of the bootstrap sample is n,; =n, - 1. 

In the grouped balanced half-sample method (GBHS) of 
variance estimation, the sampled clusters in each stratum are 
randomly divided into two groups (halves) and the balanced 
repeated replication method is applied to the groups. Rao and 
Shao (1996) showed that this method is asymptotically 
incorrect in the sense that the associated f-pivotal does not 
converge in distribution to a standard normal distribution and 
that the associated confidence intervals are asymptotically 
incorrect. To overcome this difficulty they proposed indepen- 
dently repeating the grouping 7 times and then taking the 
average of the resulting 7 variance estimates. They showed 
the asymptotic correctness of such an estimator for a stratified 
random sampling design as minn, > © and T- «, Inasmall 
simulation study they found that the method performs well for 
T as small as 15 in the case of smooth estimators. For a 
variance estimator of the population median, the RGBHS 
method performed better than the jackknife and GBHS in the 
sense that the RGBHS had a smaller relative bias and a 
smaller coefficient of variation. Recently, McCarthy (1993) 
discussed and compared a variety of procedures for variance 
estimation of the median based on simple random samples 
drawn from a finite population without replacement. His 
study includes most resampling procedures. 

Although, the linearization methods useful for nonlinear 
Statistics are difficult to implement for quantiles since density 
estimation is involved, Binder (1991), Binder and Kovaéevic 
(1995) and Kovaéevi¢ and Binder (1997) obtained consistent 
estimators for the variance of some non-smooth measures of 
income inequality and polarization using the linearization 
method within the estimating equation framework. 
Estimators obtained using this method are computationally 
simpler than the resampling estimators but require theoretical 
derivation. 
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Variance estimation of the Gini Index has been studied by 
several authors under the assumption of simple random 
sampling, Glasser (1962), Sendler (1979), Sandstrom, 
Wretman and Waldén (1985) and Yitzhaki (1991). In the 
case of a complex design, Love and Wolfson (1976) proposed 
a ‘crude half-sample replication’ method. Sandstrém, 
Wretman and Waldén (1988) compared approximate variance 
techniques with the delete-one-unit jackknife for three 
sampling designs, two of which were complex. 

Estimation of the variance of the Lorenz curve ordinates 
and the corresponding quantile shares has received less 
attention. The derivation of their asymptotic variances is quite 
complicated. There is the pioneering work of Beach and 
Davidson (1983) and Beach and Kaliski (1986). Their work 
is based on the superpopulation framework in which the 
survey weights are seen as constants in the construction of 
estimates. This approach, due to its model-based nature, may 
have its limitations in applications to data obtained from 
sample surveys where the sample design is deemed to be 
significant. 

In the following subsections we review the variance 
estimation methods used in this study. 


3.1 Delete-one-PSU Jackknife 


This method is based on the sequential exclusion (deletion) 
of one PSU at a time from the computation of the estimate. 
After deletion, the weights of the remaining units in the 
sample are modified in such a manner that the deleted weights 
are compensated and that the CDF estimated from the 
remaining sample has the same properties of the original 
CDF. Let F (gi) denote the estimate of the CDF based on 
a sample without the g/-th PSU, that is 


F (gy) = Gg 0)/ Negi) 
where 


Gg) a0 ya, Wri {ne SY} iy 
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: TD Weeit Vgei SY} 


I ead (Aap we 


n 
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The ‘delete-one-PSU’ jackknife variance estimator of F(y) is 


A u -12 i i 
»n FO) = LAY Fay(o)- FOF 
Seda 


g=l 


Asymptotic consistency of v egal (y)) can be established using 
results from Krewski and Rao (1981). 

For convenience, we note that all measures considered 
here can be written in the general form 


‘3 1 
6, = MF pV, BY —, 
U N 


where J(-) is a real-valued function possibly dependent on the 
nuisance parameter, B. The finite population parameter 6,, 
is then estimated by 


6 = yA NE Nips B) Wri (3. 1) 


where B denotes the estimated vector of nuisance parameters 
and wW,., are the standardized weights. Using this general 
form, the estimate of an income inequality measure computed 
from the sample after omitting PSU gj, is 


Bei = De cei Vac? Brgy) Wheicgi) 


where Pus and Bi) are the values of the distribution 
function and the nuisance parameter estimated from the 


sample with the g/-th PSU deleted and 


Wreil!N gi if h#g, 
ee n ns 
Wiese “|= Mealy if R= Be */, 


pee So ABIy 
g 
O; ie n= 2, c=), 


The resulting ‘delete-one-PSU’ jackknife variance estima- 
tor of 8 is 


1 Lane Z 
v,,(6) = > 2— YF 6, - (3.2) 


g=l n j=l 


If 6 is substituted by 6 =}, ),6,,/n a variant of the 
jackknife variance estimate is obtained. We denote it by 
v (8) Obviously v (9) <v (8). The consistency of (3.2) 
for smooth statistics has been established by Krewski and Rao 
(1981). 

In the case of variance estimation for quantiles and 
functions of quantiles, we first compute the quantiles based 
on the sample with the gj-th PSU deleted, 


Gen (P) = inf {y,,; | Feanyde > p, hcies\(gj)}, 


compute Te, = 8G ei)) and then use equation (3.2) to obtain 
a jackknife variance estimator. 


3.2 Grouped Balanced Half-Sample (GBHS) Method 
and Repeatedly Grouped Balanced Half-Sample 
(RGBHS) Method 


Originally, the balanced half-sample method was proposed 
for the two clusters-per-stratum designs. The case that we are 
interested in is when there are more than 2 clusters per 
stratum. This situation is usually handled by grouping the 
clusters (primary stage units) in each stratum into two groups. 
We explore the idea given by Wu (1991) and simplify its 
application for the variance estimation of the CDF. First, in 
each stratum /, (h = 1,...,L), the PSU’s are grouped at random 
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into two halves, A, and h,, containing m, = [n,/2] and 
m, = M,—m, PSU’s, respectively. Setting the group indica- 
tor to 


where r = 1,...,R denotes a half-sample (replicate), the half- 
samples are balanced on the groups if Y”,8”=0 and 
ye 8,” 8) = 0,(h#h’). A minimal set of balanced half- 
samples can be obtained from a Hadamard matrix of order 
R(L+1<R<L+4). 

The estimator of the distribution function based on the r-th 
half-sample is ; 


FM) = LO) 


Nn 


where 


A (r) r ~F 
Ge ”) 6 Yh ae Mri! (Vici s y}, N : = 
(r) 
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and A MG is the weight modifier and is constant for all clusters 
in the same half-sample. We assume that the weights of all 
units (households) in a cluster are rescaled equally by the 
modifier 4,?, 

The standard GBHS method, when n, is even, uses 


(r) 
Pram ONE ICE ip (3.3) 
he (rn) ‘ 
Op cer, 


which means that the weights are modified either by 2 or 0 
depending on whether a unit is in the replicate or not. When n, 
is odd, a number of different modifications have been 
considered (see Shao 1993 and Sitter 1993). 

The method that we are using is based on the standard 
balanced replication resampling plan and a variant of the 
rescaling method proposed by Shao (1993): 


@ _J1+(-a,)8), ceh; 
| 1-1 -5,)8”,. ceh,. 


The maintenance of the stratum sample size in any of the 
half-sample replicates means that 


Ye + -4,)8')+ 4% 1 - 1 - 4,87) =n, 


cehl ceh2 


which results in 


1+(1-a,)8,,  ceh,; 

A = Mm, Ae 3.4 
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To ensure the non-negativity of the modified weights, a, 
should satisfy 0 < a,<1. When n, is even we would like 
(3.4) to reduce to (3.3). Following Shao’s idea (1993), we 
want the GBHS variance estimator to agree with a consistent 
estimator of the variance in the case of linear statistics. This 
leads to the following requirements for the stratum-specific 
perturbation factors | - a,: 

For all h: (i) 0< 1 - a, < 1; Gi) (1- a,)°(m,/m, = 1; 
Gina aa: m, /m, = 1. For the even n,’s we simply let 
aa va We However, keeping 1 - a, = 1 for odd n,’s would 
exclude any contribution from the clusters in the first half- 
sample when 8)” = - ], see equation (3.4). For the purpose 
of the simulation study we chose 


(3.5) 


which reduces to 1 for an even n,. In the case of an odd 
stratum sample size it is equal to /1 - 1/(n, + 1). In our 
simulation study very few strata have an odd n, and we 
obtain Vop,( fy) = Vop(fy) = v, (fy) where fi, is the sample 
mean and v, (fi,) is the commonly used linearization variance 
estimator. However, it is felt that more research is needed into 
modifying the GBHS method to handle many strata 
containing an odd number of PSU’s. 

As in the case of the jackknife method, the estimate of the 
income inequality measure Computes from the r-th half- 
sample is 8° =) .J(F, Yq, B) Wye, Where B” is an 
estimate of the nuisance speromcicn based on the r-th half- 
sample and WP =W,j,A ie The resulting GBHS variance 


A 


estimator of 6 is 
SU an ee Ss 
Yen) = 5 yd 6-6). (3.6) 
r=] 


By repeating the random grouping of units within each 
stratum T times, computing v, PAC) each time and averaging 
over the 7 repetitions we obtain the Repeatedly Grouped 
Balanced Half Sample (RGBHS) variance estimator 


reypnrryacs s 
Vegi) = T » Yop (8). 
t= 


A variant of the GBHS estimator (and RGBHS) is 
obtained by replacing 6 by 6= y6/R, and will be denoted 
by VGp,(9) (and v,,,,(8)). 

Needless to say that when weights are calibrated they have 
to be properly modified for each GBHS replication using the 
same balanced half sample procedure. 


3.3 Bootstrap Method 


We also investigated the performance of the bootstrap 
method for variance estimation of different income statistics. 
We adopted the bootstrap resampling scheme for the stratified 
cluster sample as given by Rao, Wu and Yue (1992). Briefly, 
draw a simple random sample of n,- 1 clusters with 
replacement (from the n, sample clusters) independently in 
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Table 1 
Definition of u,,, Variates for the EE Approach 
Measure Ui 
7 4 ay: A Aare tb 
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Quantile Share 
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each stratum. The bootstrap weight, w,,.,, is obtained by 
modifying the original weight w,_, as follows: 


where 


n,~ 


and m,,. is the number of times the Ac-th cluster is selected. 
Note that )m,.=n,-1. This procedure is repeated 
independently B times; for each bootstrap sample, we 
calculate 6° = WIC V8 Wc where f° is an estimate 
of the nuisance parameter based on the bootstrap sample and 
Wri = Whei! LsWhei- The bootstrap estimate of the variance of 
6 is then given by 


Vp, (8) a 


Another variance estimate is obtained by substituting 6 with 
the mean of bootstrap replicates. 


3.4 Linearization via the Estimating Equations 
Approach 


The estimating equations (EE) approach of Binder (Binder 
1991, Binder and Patak 1994, Binder and Kovaéevic 1995), 
unlike the resampling methods, is not computationally 
intensive. This method, based on linearization, provides 
formulae for asymptotic variances which are easy to program 
despite their complicated appearance. 

Applying the EE methodology as given in Binder and 
Patak (1994), Binder and Kovaéevic (1995) and Kovacevic 
and Binder (1997) one obtains expressions for the 
approximate variance estimators of the studied measures as 


n * fats 
Venn De ers y.( Une ~ Up iB (3.7) 
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where uy = Li WaciMncir Hy = Lene!» and wW,,, is a 
normalized weight. For more on the EE approach, in 
particular the relationship between the w,., variates and the J 
function, we refer the reader to Binder and Kovaéevié (1995). 
The u,., variates for the considered measures are given in 
Table 1. 

The expressions for the u,., variates for the low income 
proportion and polarization index depend on the estimate of 
the density function at the median, eae: and half of the 
median, f(€ 5/2). An appropriate method for estimating 
these quantities is given in Binder and Kovaéevié (1995). 


4. SIMULATION STUDY 


4.1 Data and the Design of the Simulation Study 


The Ontario sample from the 1988 Canadian Survey of 
Consumer Finance (SCF) was used as the underlying 
population of the study. The SCF is an annual supplement to 
the monthly Canadian Labour Force Survey. The population 
contained 7474 households in 525 PSU’s from 40 strata. 
Originally, the Ontario sample was taken from 91 strata which 
we collapsed to form sufficiently large strata. For each 
household a nonnegative value of the total annual income was 
available. The distribution of the income on this micro 
population was highly skewed to the right with coefficients of 
skewness and kurtosis obtained as 4.5 and 89.5, respectively. 
The true values of the parameters of interest (measures of 
income inequality and polarization) were computed from 
this population. Neyman allocation was used to assign 108 
sample clusters (PSU’s) to the 40 strata. A one-stage cluster 
design with the strata samples sizes between 2 and 6 clusters, 
selected with probability proportional to size and with 
replacement was used. In a selected cluster all households 
(6 to 20) were enumerated. 

We considered the following measures in the study: Gini 
Index, Low Income Proportion, Polarization Index, a set of 
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Quantile Shares, a set of Lorenz Curve Ordinates and the 
corresponding quantiles. The MSE’s of the estimates of these 
measures were approximated by the empirical mean squared 
error (EMSE), computed over 10,000 independent samples 
drawn by the design explained above. These EMSE’s were 
used as ‘true’ MSE’s for comparison with the estimated 
variances. 

From each of the 10,000 samples, along with the estimates 
of the parameters, we computed estimates of the sampling 
variances using the following methods: the delete-one-PSU 
jackknife (JK), the grouped balanced half-sample (GBHS) 
and the repeatedly grouped balanced half-sample (RGBHS), 
the bootstrap (BS) and the linearization method via estimating 
equations (EE). For all resampling methods two different 
estimators were used, one using the ‘full sample’ estimate and 
another one using the mean over all replicates. The jackknife 
variance estimators were based on 108 jackknife replicates 
while the bootstrap method was based on 100 replicates. The 
GBHS and RGBHS were based on 44 balanced replicates 
obtained from a 44 by 44 Hadamard matrix and 3 repetitions 
for RGBHS, totalling 132 half-sample replicates for this 
method. Note that the number of jackknife replicates is non- 
arbitrary and is determined by the number of clusters in the 
sample. Similarly, the number of GBHS replicates is deter- 
mined by the number of strata. In order to make the number 
of replicates comparable over all methods, we decided to have 
100 (= 108) bootstrap replicates and 3 repetitions of the 
GBHS resulting in 132 replicates for RGBHS. 

In order to evaluate the accuracy and the precision of the 
considered methods we computed their relative biases and 
relative variance (instability) over the A= 10,000 simulations: 


YY Aa/A - EMSE 
EMSE 


f >, (4y,(@) - EMSE}’/A 


EMSE 


rel. bias(v,,) = 


rel. var.(v,,) = 


To evaluate the effectiveness of normal-theory confidence 
intervals, empirical coverage rates were computed for 
nominal confidence coefficients of 100(1 - a)% = 90, 95 and 
99 percent, 


¥, 7 18,- 01 /y@ < 29) 
A 


cov. prob. (v,,) = 


where z,,. is the upper a/2-th standard normal percental. 
Upper and lower tailed error rates were also calculated as 
follows, 
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The large set of results obtained from the simulation study 
are summarized separately for each income inequality 
measure. 


err_U(v,,) 


4.2 Summary of Findings 


Gini Index 


Concerning the accuracy of the variance estimators for the 
Gini index, all methods performed similarly, with very small 
negative relative biases ranging between -2.2 and -0.6 
percent. Of all the estimators, the RGBHS estimators had the 
smallest relative bias. 

All estimators were found to be of approximately the same 
stability, in the range of 87-99%. The grouped balanced half- 
sample methods (GBHS and RGBHS) perform slightly worse 
than other methods. 

The coverage probabilities for the 95% confidence 
intervals were in the range of 92.6 (for GBHS) to 93.9 (for 
RGBHS). The lower tail error rates were understated by the 
nominal 2.5% rate for all methods considered. We found that 
the lower tails were more than 100% heavier than the nominal 
2.5%, ranging between 4.6 and 5.4%. The upper tail error 
rates were overstated by the nominal rate for all methods. (See 
Table 2). We also computed the coverage rates for the 90% 
and 99% confidence intervals and they were in the range of 
87.2 (for GBHS) to 88.5 (for RGBHS) and in the range of 
97.7 (for GBHS) to 98.5 (for RGBHS), respectively. 
Similarly, the tail rates for the nominal 5% and 1% followed 
the pattern of 2.5%. 

Overall, for variance estimation of the Gini index it is 
difficult to say which method is the best since all compared 
methods performed similarly. There is a slight trade off 
between accuracy and stability in the case of the balanced 
half-sample methods which give the most accurate estimates 
of the variance but at the same time the least stable. The 
empirical coverage probabilities for all of the estimators are 
also very similar. The realized values of the tail error rates 
suggest that the use of asymmetric confidence intervals is 
more appropriate. 


Low Income Proportion (LIP) 


All methods considered tended to overestimate the 
variance of the LIP. However, the difference in the magnitude 
of overestimation was large, and ranged between 1.1% for the 
EE and 76.9% for the JK1. The best performer among 
resampling methods was the bootstrap, where the relative bias 
for the BS1 estimator was 8.9% and for BS2 3.8%. 

The jackknife estimate of the variance of the LIP was very 
unstable. The GBHS estimators also had increased instabili- 
ty. The bootstrap and EE estimators performed similarly with 
relative variation between 31 and 45%. 
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Table 2 
Values of the Evaluation Statistics for the Variance Estimators of the Gini Index 
; Estimating 
Jackknife GBHS RGBHS Bootstrap Equations 
an Vp Yosi _"GR2 Vegi RG2 VBi VB2 VEE 
Relative Bias (%) -1.3 -1.3 -0:9 -1.1 -0.6 -0.7 -1.2 -2.2 -1.5 
Relative Variation (%) 87.1 87.1 99.4 99.2 LP Jina lbd lai 88.5 87.6 87.0 
Coverage Probability (95%) 93.8 93.8 92.6 92.6 93.9 93.9 93.5 93.4 93.7 
Tail Error Rates (2.5%) 1 4.8 4.8 5:4 (PMSY oled-Ge 6 5.0 5.1 4.9 
U 1.4 1.4 2.0 2.0 1.5 1.5 1) 1.5 1.4 
Table 3 
Values of the Evaluation Statistics for the Variance Estimators of the Low Income Proportion 
Jackknite GBHS RGBHS Bootstrap eae 
quations 
an Vp V GEV GED RG URGD VBI VBo Vig 
Relative Bias (%) 76.9 58.4 Dye OW) ~valoyes PALO) 8.9 3.8 al 
Relative Stability (%) 113.1 81.0 62.58) 61.0 M408) 395 35.1 835 31.0 
Coverage Probability (95%) 97.4 96.9 94.6 94.1 96.2 95.7 93.9 93.3 93.2 
Tail Error Rates (2.5%) IG. Hop) 2.6 3 See 2.4 2.6 4.6 a0 5.0 
U 0.5 0.6 2:050p 2:4 1.4 7 LS) 1.7 Ie) 


The 95% confidence interval for the LIP based on the JK 
variance estimates had higher than nominal coverage rates, 
97.4 and 96.9%, consequences of the overestimation of the 
variance. The other methods had slightly lower coverage rates 
than nominal. The tail error rates showed that all methods 
resulted in heavier lower tails, indicating a skewed distri- 
bution of the LIP with a long tail to the night. For the cases of 
90% and 99% confidence intervals we obtained exactly the 
same pattern for the coverage and the tail error rates. 


Overall, for variance estimation of the LIP, the bootstrap 
and the EE method show supremacy over the other methods 
considered. 


Polarization Index 


The evaluation statistics for the variance estimators of the 
polarization index showed a high level of agreement in 
performance with variance estimation for the low income pro- 
portion. Again, the bootstrap and EE method were the best. 


Values of the Evaluation Statistics for he wae Estimators of the Polarization Index 
Jackknife GBHS RGBHS Bootstrap saiece 
bi) ay) YGsi YGa2 YRGI VRG2 Yai 47) VEE 
Relative Bias (%) 95.4 56.5 13:9 11.2 14.7 12.1 6.0 2.9 4.2 
Relative Stability (%) 138.7 78.5 77.5 539 60.0 58.6 48.4 47.0 50.0 
Coverage Probability (95%) 98.6 98.0 94.2 93.8 95.4 95,2 95.0 94.7 94.4 
Tail Error Rates (2.5%) L 0.7 0.8 DD) 2.4 1.4 1.4 1.8 2.0 2.0 
U 0.8 1.1 3.6 3:9 32 3.4 32 3.4 3.6 
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Lorenz Curve Ordinates and Quantile Shares 


The full results for the Lorenz Curve Ordinates and 
Quantile Shares are given in Kovaéevi¢c, Yung and Pandher 
(1995). We present here a graphical summary of the results 
in Figures la-lc. The jackknife method (both estimators) 
significantly overestimates the variances of all considered 
Lorenz Curve Ordinates (LCO) and Quantile Shares (QS). 
The relative bias of the JK1 estimator for the LCO ranged 
between 15 and 45% and between 9 and 27% for the JK2 
estimator. The relative bias was smaller in the middle of the 
interval (0 < p< 1) and almost three times larger at the tails 
(for small and large values of p). The relative bias of the JK1 
estimator was about 50% larger than the relative bias of the 
JK2 estimator for the LCO. The difference can be attributed 
to the significant difference between the full sample estimate 
of the LCO and the average taken over jackknife replicates. 

Similar findings held for the performance of the JK 
variance estimators for QS’s which overestimated the 
variance between 26-237%, depending on the population 
share. The largest overestimation appeared in the middle. 
Again, the JK1 was larger than JK2 by about 75%. 

The magnitude of the relative bias was very small for the 
other two methods. However, there was no clear pattern 
about the direction of bias — sometimes it was positive, but 
often it was negative. The bootstrap estimators and the EE 
estimator outperformed the other methods, especially around 
the LCO corresponding to p = 0.5 (see Figure 2a). For clarity 
of the graphical presentation the JK methods are not shown in 
Figures 2a and 2b. 

The variance of the QS’s is estimated similarly. The 
bootstrap and EE provided the most accurate estimates of the 
variances of LCO and QS. For the LCO the relative bias 
ranged between -2 and +3% for bootstrap and -5 to +1% 
for EE. At the same time, for the QS, the bootstrap estimates 
had relative biases between -3 and + 8% and EE estimates 
between -3 and +5%. 

Concerning the stability of the different variance 
estimators we found that all methods perform similarly with 
a slight advantage for the EE method. Also, there is an 
obvious direct dependence of the relative variation measure 
and the value of p. 

When we compared the methods according to the coverage 
properties of the variance estimators for the LCO and QS we 
found that for the nominal 95% confidence interval, the JK 
method gave empirical coverage rates between 94.5 and 
96.5% for the LCO and 94.5 to 99% for the QS. Other 
methods performed similarly with coverage rates between 88 
and 94%. Better coverage was found for the LCO and QS 
with smaller value of p (see Figurelc). In contrast to 
findings for the Gini index, the lower tail error rates were 
about twice the upper tail error rates for all methods and for 
both LCO and QS. A similar pattern was observed for 90% 
and 99% confidence intervals. 

Our empirical findings suggest that the jackknife method 
is not a good choice for the variance estimation of the LCO 
and QS especially for small and large values of p. Much 
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better alternatives are the GBHS or the RGBHS. However, 
the best choice is either the EE method or the bootstrap. 


Rel.Bias (%) 


a) Relative Bias 


Rel. Variation (%) 


b) Relative Variation 


Coverage Rate (%) 


c) Coverage Rate (for Nominal 95%) 
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Figure 1. Properties of the Variance Estimators of Lorenz 


Curve Ordinates 
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c) Coverage Rate (for nominal 95%) 
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Figure 2. Properties of the Variance Estimators of Quantile Shares 


Quantiles 


The full results obtained for the quantiles are presented in 
Kovacevi¢, Yung and Pandher (1995) and are summarized 
graphically here. The relative bias of the JK1 estimate of the 
variance for the quantiles was between 23 and 67% and for 
JK2 between 17 and 52%. The largest overestimation 
occurred for the variances of Ens and Ease, The RGBHS 
and GBHS show quite a different picture. The variance of the 


median was overestimated by 27% but the variances of tail 
quantiles were obtained very accurately, with the relative bias 
between 3 and 7%. Other methods also performed much 
better for the tail quantiles and moderately better for the 
median and quantiles around it. In particular, the bootstrap 
and the EE method produced estimates with the smallest 
relative biases, although without clear pattern about the 
direction of the bias. For the bootstrap estimators, the relative 
bias was in the interval (-5%,+9%), and for EE (-8%, +9%) 
(see Figure 3a). 


Rel.Bias (%) 


a) Relative Bias 


Rel. Variation (%) 
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b) Relative Variation 


Coverage Rate (%) 


c) Coverage Rate (for nominal 95%) 
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Figure 3. Properties of the Variance Estimators of Quantiles 
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Table 5 
Rankings of methods by relative bias, relative stability and empirical coverage probability 


Jackknife GBHS RGBHS Bootstrap EE (Taylor) Best methods 
Gini Index All procedures performed similarly ~ 
Quantiles seh 3,4,4 4,3,1 12 33 Drill EE, BS 
Lorenz Curve oh 3,4,4 4, 3,1 123 PAA ee BEXBS: 
Quantile Shares 5), he 3,4,4 4,3,1 Ih, 22922 Dales BS, EE 
Low Income Seon s Bane Aaa Deed ela EE, BS 
Polarization Index 3D Bye | 4, 3,2 Dev les ees} BS, EE 


The jackknife estimators were the least stable. The 
RGBHS, bootstrap and EE showed similar stability which, on 
average over all quantiles, was about three times higher than 
the stability of JK estimators. The highest stability was 
attained around the median (see Figure 3b). 

In general, the coverage probabilities for the quantiles 
were less than nominal for all of the methods considered, with 
some exceptions for the GBHS and RGBHS methods (see 
Figure 3c). When we compared the observed tail error rates, 
it seemed that all methods exhibited similar behaviour, for the 
lower quantiles (p =0.1,0.2) the upper (right) tails were 
heavier; for others it was opposite, the lower tails were 
heavier. Similar results were obtained for the 90% and 99% 
confidence intervals. 

The findings from this empirical study confirm that for 
variance estimation of quantiles, the jackknife method should 
be avoided. For the variance of the median, in particular, the 
best choice seems to be either the EE or the bootstrap. For 
other quantiles the RGBHS showed very good performance 
as well. 

We condense our findings in Table 5 where the relative 
bias, relative variation and the coverage probabilities for the 
methods considered were ranked from 1 to 5 (1 = the best). 
For the resampling methods we averaged the values over both 
estimators. For the quantiles, LCO and QS we averaged the 
values over all p’s. The last column contains the choice of 
the two best performing methods. 


5. DISCUSSION AND CONCLUSION 


The linearization method via EE has shown the best overall 
performance, the smallest relative bias, the smallest relative 
variation and relatively good coverage properties. Next to the 
EE method is the bootstrap method, as the best resampling 
method considered. The RGBHS and GBHS method 
performed comparably well for the Lorenz Curve ordinates, 
quantile shares and some of the quantiles, in the sense of the 
small relative bias and relative stability comparable with the 
bootstrap method. The jackknife method has performed 
poorly for all measures except the Gini index. 

It is well known that the jackknife variance estimator 
performs poorly for non-smooth functions. The smoothness 
of the J function defined in (3.1) is an essential determinant 


of the asymptotic properties of its variance estimator. 
Classifying our measures as smooth or non-smooth on the 
basis of the J functions, we see that the only smooth esti- 
mator considered here was the Gini index. Not surprisingly, 
the Gini index was the only measure for which the jackknife 
performed well. However, when considering the jackknife 
variance estimator, care must be taken to ensure that the 
assumptions under which the jackknife is valid are fulfilled. 

If the goal is to provide one method for variance estimation 
for the large list of different income statistics, our empirical 
study has shown that the bootstrap is the best resampling 
choice, and that the linearization via the estimating equations 
approach is the best computationally non-intensive method, 
which however, requires some preparatory algebraic work, 
different for each measure. 

It should be emphasized that the empirical study was based 
on an one-stage cluster sampling design, with the clusters 
selected proportionally to their size, so the intracluster 
variability was not accounted for. Some other limited studies 
have shown similar behaviour of these methods in the case of 
two stage sampling plans (see Binder and Kovaéevié 1995, 
and Kova¢evi¢ and Binder 1997). 
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Instrumental Variable Estimation of Gross Flows in the Presence 
of Measurement Error 


kK. HUMPHREYS and C. J. SKINNER' 


ABSTRACT 


The problem of estimating transition rates from longitudinal survey data in the presence of misclassification error is 
considered. Approaches which use external information on misclassification rates are reviewed, together with alternative 
models for measurement error. We define categorical instrumental variables and propose methods for the identification and 
estimation of models including such variables by viewing the model as a restricted latent class model. The numerical 
properties of the implied instrumental variable estimators of flow rates are studied using data from the Panel Study of 


Income Dynamics. 


KEY WORDS: Latent class; Longitudinal; Misclassification; Transition rate. 


1. INTRODUCTION 


One of the major benefits of longitudinal surveys is that 
they permit the estimation of gross flows, for example flows 
out of unemployment into employment (see e.g., Hogue and 
Flaim 1986). A key problem when estimating flows is the 
bias induced by measurement error. For the estimation of 
cross-sectional proportions, misclassification into and out of 
states may tend to cancel out (Chua and Fuller 1987). Such 
compensation tends not to occur, however, when estimating 
longitudinal flows. 

The first response to the problem of measurement error 
should clearly be to attempt to reduce the error in the survey 
measurement procedures. Relevant approaches are discussed 
by Biemer, Groves, Lyberg, Mathiowetz and Sudman (1991), 
but will not be considered here. Even with the “best” survey 
procedures, however, some measurement error will inevitably 
arise and there will remain a need to compensate for the effect 
of error in the survey analysis. 

Methods for compensating for measurement error are 
generally based on some assumed model of the error process. 
Some models which have been proposed in the literature will 
be referred to in Section 2. In order to identify and estimate 
these models it is generally necessary to use additional 
auxiliary information, such as provided by reinterview studies 
(e.g., Meyer 1988). Since reinterview studies are costly, 
however, and since in practice their aim is often not to 
estimate the characteristics of the measurement error 
distribution (Forsman and Schreiner 1991), there remains a 
need for alternative procedures which may be used when no 
reinterview data is available. For measurement error on 
continuous variables, a common approach employed in the 
absence of auxiliary information about the measurement error 
distribution is the method of instrumental variable estimation 
(e.g., Fuller 1987, Sect. 1.4). An instrumental variable is a 
variable included in the survey dataset which is related to the 


true variable measured with error but is uncorrelated with the 
measurement error. These and associated assumptions supply 
information which replaces that provided by reinterview 
studies and enables parameters of the model involving the 
true variable to be identified and estimated. The aim of this 
paper is to investigate how the instrumental variable 
estimation method may be adapted to estimate flows among 
discrete states. We find that latent class models (e.g., 
Bartholomew 1987, Ch. 2) provide a general framework 
within which the assumptions about the instrumental variable 
correspond to certain restrictions on the model parameters. 
Our approach is thus related to other approaches which 
impose restrictions on latent class models (e.g., van de Pol 
and de Leeuw 1986; van de Pol and Langeheine 1990). 


2. MODELS 


We consider only the case of two occasions t=1 and 
t = 2. Let the number of states into which each individual can 
be classified at each occasion be r. Denote the classified 
states at f= 1 and t=2 by X and FY respectively and the 
corresponding true states by x and y. We assume a model in 
which the vectors of values of (X, Y, x, y) are generated as 
independent outcomes of a common random vector with 
distribution pr(X =i, Y =/,x =u,y =v). 

The first assumption about this distribution, made by a 
number of authors (e.g., Abowd and Zellner 1985; Poterba 
and Summers 1986 and Chua and Fuller 1987) and which we 
shall also make, is that the classification errors on the two 
occasions are conditionally independent given the true states, 
that is 


pr(X =i, Y=j |x =u,y =v) = 


pr(X =i|x=u,y=v)pr(Y=i|x=uy=v). AD 
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Such an assumption is common in general latent variable 
models (e.g., Anderson 1959). It seems a reasonable initial 
assumption when the survey measurement procedures are 
independent on the two occasions. On the other hand, if X is 
obtained retrospectively from the same interview in which Y 
is measured then it seems likely that the tendency for 
respondents to give over-consistent responses in a single 
interview may tend to induce positive association between 
classification errors. See, for example, Marquis and Moore 
(1990) on evidence from the Survey of Income and Program 
Participation. A further reason for doubting the conditional 
independence assumption is the possibility of individual 
heterogeneity in misclassification probabilities, for example 
some respondents may be more reliable than others. See 
Skinner and Torelli (1993) and Singh and Rao (1995). In 
Section 4 we shall allow for heterogeneity by assuming only 
that the model holds within cells of a cross-classification of 
observed variables. 

Our next basic assumption is that classification error only 
depends on current true state so that 

pr(X =i |x =u,y =v) =pr(X=i|x=u)=K,,, say, 


xiw? 


pr(Y=/|x=u,y=v)=pr(Y=/|y=v)=K,,,say. (A2) 


yy? 
The K,,, and kK, define r x r misclassification matrices 


K,.=(K,,,,] and K, = [K, ,]. Letting P denote the r x r matrix 


with ij-th element pr(X = i, Y = /) andII the r x r matrix with 
uv-th element pr(x = u, y = v) we have the matrix equation 


P= K, IK). (1) 


The matrix II contains the parameters of interest, whereas 
it is the matrix P which may be estimated consistently from 
sample X and Y values. If auxiliary estimates of K, and K, 
are available and these are non-singular then we can solve 
equation (1) to obtain estimates of II. If it is possible to 
ascertain the true states in reinterview studies then K, and K 
may be estimated directly (Abowd and Zellner 1985). On the 
other hand, if the reinterview study only provides independent 
reclassifications then it is only possible to estimate the 
interview-reinterview matrices 


K_A_K/ and K.A K’ 
SPER ae Viewym ay, 


where A, =diag[pr(x =w)], A, =diag[pr(y=v)] (Chua and 
Fuller 1987). Each interview-reinterview matrix is symmetric 
with elements summing to one and so only contains 
r(r+1)/2- 1 “independent” items of information. Since each 
column of each K matrix and the diagonal of each A matrix 
sum to one, the number of unknown parameters on each 
occasion is r(r- 1)+r-1=r?-1. The excess of parameters 
over items of information is therefore r?- 1-r(r+1)/2+1= 
r(r-1)/2 at each occasion and so the model is 
underidentified for r > 2. Chua and Fuller (1987) suggest 
that a natural extra assumption to make to help achieve 
identification is to suppose that the measurement errors are 
unbiased on each occasion in the sense that 


pr(x =i) =pr(X=)), pr(v=) =pr(Y=)) i=1,...,r. (2) 


In this case false positives and false negatives tend to 
compensate for each other in cross-sectional estimates of 
proportions. This assumption reduces the number of 
parameters by r-1 on each occasion. Even under this 
assumption the model remains underidentified for r > 3 and 
Chua and Fuller (1987) have to introduce further 
assumptions. 

Let us now consider how the model might be identified 
when no reinterview data is available. For simple linear 
regression with measurement error in the covariate, the 
instrumental variable approach (Fuller 1987, Sect. 1.4) 
assumes the availability of an observed “instrumental” 
variable W, which is correlated with the covariate, but is 
independent of the measurement error and independent of the 
error in the regression equation. We extend this assumption 
to our framework by defining W to be an instrumental 
variable if it is not independent of x and if 


W and (X,Y) are conditionally independent given (x,y), (A3) 
W and y are conditionally independent given x. (A4) 


In general we shall allow W to be a categorical variable 
with an arbitrary number s of categories, although since we 
shall desire W to be closely related to x, we shall usually have 
Ss =r inpractice. One specific possibility is to take W as the 
classified state at time ¢- 1. This use of a lagged value of a 
“covariate” as an instrumental variable may be traced back to 
the earliest discussions of instrumental variable estimation 
(e.g., Reiersol 1941; Durbin 1954). In this case, assumption 
A4 follows if the true states obey a Markov process and the 
classification errors are conditionally independent, as in Al. 

The model resulting from assumptions (A1)-(A4) may be 
represented by the conditional independence graph in Figure 1. 
Each vertex in the graph represents a variable. Edges between 
pairs of vertices are absent if the corresponding variables are 
conditionally independent given the remaining variables. 


ee y 
| 

x y 
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Figure 1. Conditional Independence Graph of Basic Model 


The model is an example of a restricted latent class model 
(Goodman 1974), where the observed variables _X, Y and W 
are conditionally independent given the latent variables x and 
y, that is they are independent within the r? latent classes 
defined by the pairs of values of (x,y). There are 
2(r-1)r*+(s-1)r? + (r?- 1) parameters of this model given 
by the (r-1)r? parameters pr(X=i|x=u,y=v), the 
(r-1)r? parameters pr(Y=/|x=u,y=v), the (s-1)r? 
parameters pr(W=k|x=u,y=v) and the r?-1 free 
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parameters pr(x =u, y=v). These parameters are subject to 
the 2r(r-1)* restrictions in (A2) and the (s-1)r(r- 1) 
restrictions implied by (A4). We first restrict attention to the 
case r = 2. In this case there are 4s + 7 parameters subject to 
2s + 2 restrictions, leaving 2s + 5 free parameters 


{K.o, 2 


Kyyy> Pry Pur Oy» Ts = 1,2 a 1,2}, 


where 9,, = pr(W=k|x =u), 0, =pr(y =2 |x =u), andz = 
pr(x = 2). The number of “free” cell probabilities in the 
observed table of X by Y by Wis r*s-1, or 45-1 when 
r=2. Hence a necessary condition for identification when 
r=2 isthat 4s-1>2s+5 or s > 3. Unfortunately, this is not 
a sufficient condition. For let 


ylv ~u 


2 
Ree pix = 1) = IK 6 AiO ev (3) 
v=l 


Then 
pr(xX =i, Y=j,W =k) = 


2 , ' 
Y KR, U-R) ia ta-a" 4) 
u=] 


Hence the 4s - 1 free cell probabilities are determined by just 
the 2s +3 parameters 


ee By nei, irs Wile 2} 


so a necessary condition for identification of these parameters 
is that 4s-1>2s+3 or s>2. In fact this is also a 
sufficient condition for identification of these parameters, 
except for certain exceptional combinations of these 
parameters. (See Madansky (1960) for the case s = 2 and 
Goodman (1974) for the case of general s > 2.) 

However, even though the above 2s + 3 parameters are in 
general identified for s > 2 itis not possible to determine the 
4 parameters K,,,, K,), @, and @, since they are related to 
only two identified parameters, R, and R,, via equation (3). 
In particular the key parameters of interest 0, and @, remain 
underidentified whatever the value of s. 

It is therefore necessary to impose at least 2 further 
restrictions on the model to identify @, and @,. Following 
Chua and Fuller (1987), one idea would be to assume 
unbiased measurement errors as in (2) which imposes the two 
constraints 


m=K.,,(1- 1) + Kt (5) 


6,1 - 1) +6,n=R (1 - 2) + Rat. (6) 


Unfortunately the first constraint only applies to the 
parameters which are already identified for s > 2 so these 
constraints are insufficient to identify 6, and @,. An 
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alternative assumption which we shall make is that the error 
process is constant over time so that 

KK. Lhe. = Say, fO8 ty =i1,2,....,7, (AS) 
This seems a natural basic assumption if the same survey 
measurement procedure is used over time. The under- 
identification problem for the case r = 2 discussed above is 


removed by this assumption since, given the identification of 
Ki, = Kp, and R,, we can determine @, from (3) by 


xt 


6, =(R, ~ Ky, (Ky - Ky) (7) 


(excluding the trivial case when the measured variables are 
independent of the true variables so that K,, = K,,). 

In summary, when assumptions (Al) - (A5) hold and 
r=2, our model has 2s+3_ free parameters 
{K, 2 Po» Po 9,1; u = 1,2} which are identified if s > 2, 
except in exceptional cases such as discussed by Madansky 
(1960). 

Finally, let us return to the case of general r. Since (A5) 
imposes (7 - 1)r restrictions, the number of free parameters 
becomes 2(r- 1)r?+(s- 1)r*+(r?- 1)-[2r(r- 1)? +(s- 1) 
r(r-1)]-(r-1)r=2r? +sr-2r-1. There are r’s - 1 free 
cell probabilities in the table of X by Y by W so the model will 
in- general be identified if r(r- 1)(s- 2)2>0. Thus the 
condition for identification of these parameters remains s > 2, 
for any value of r > 2. Furthermore we can write 


Ri, Pr(Y =j | x=u) 2 Kx 


where 6 =pr(y=v|x =u). Hence, provided the matrix 
[K,,,] is non-singular, the @,,, may be determined from the R,, 
and K,, and hence are also identified. Thus for general r, the 
model is identified under assumptions (A1)-(A5), except for 
exceptional cases as discussed by Goodman (1974). 


3. ESTIMATION 


We shall suppose that for a sample of size n we observe 
counts n,, in the cells of the rxr xs contingency table of 
X x Y x W, and that these are multinomially distributed with 
parameters 1 and Pig = pr(X = i, Y = j, W = k). The implied log 
likelihood is 


l= 3 y » Nip lOSP ip. 
ay 


Under a complex sampling design, we may take the N,, to 
be weighted counts, giving a pseudo log likelihood (Skinner 
1989). The estimators of the parameters obtained by 
maximising / will be called instrumental variable (IV) 
estimators. 

For the remainder of this paper we shall only consider the 
case r =s =2 when the model is just identified (except for 
exceptional values of the parameters). In this case we might 
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attempt to set p,, = 1,,/n and then solve equations (6) and (7) 
for the unknown parameters. If the resulting solutions lie 
within the feasible parameter space, that is probabilities lie in 
the range [0,1], then these solutions will be the IV estimates. 
However, in practice we have found that, for moderate sample 
sizes, infeasible solutions can often arise. Furthermore the 
solution of these equations is not computationally straight- 
forward. Hence we have found it easier to maximise / directly 
using the numerical procedures in the package GAUSS 
(Edlefsen and Jones 1984) or else by using packages which fit 
latent class models using the EM algorithm such as 
PANMARK (van de Pol, Langeheine and de Jong 1991). For 
a latent class package it would be possible to fit an 
unrestricted two class model and then to estimate 6, and 6, 
via (7). However, there would be no guarantee that the 
resulting estimates would lie in the feasible range [0,1] with 
this approach. Furthermore there would be the additional 
complication of determining standard errors for the estimates 
of 6, and @, from the covariance matrix of the estimates of 
(R,, R,, K,,, Ky»). Hence we have found it more convenient 
to fit the model directly as a restricted latent class model. A 
further advantage of this approach is that it extends naturally 
to the fitting of similar models across subgroups subject to 
possible constraints that some parameters are constant across 
subgroups. This possibility is explored further in Section 4. 

Under multinomial assumptions, standard errors may be 
based on the second derivatives of the log-likelihood 
evaluated at the IV estimates. This approach becomes proble- 
matic, however, if the maximum of / is at the boundary of 
the parameter space. One approach then is simply to treat the 
values of the parameters at the boundary as known. However, 
this is likely to lead to underestimation of uncertainty. Baker 
and Laird (1988) consider two alternative approaches to 
obtaining interval estimates for individual parameters in such 
circumstances: a bootstrap method and a profile likelihood 
method. The bootstrap method involves drawing repeated 
multinomial samples with Pip Set equal to n,,/n and 
recording the distribution of parameter estimates across 
repeated bootstrap samples. Interval estimates for given 
parameters are obtained by the profile likelihood methods as 
the sets of values of the parameter which are not rejected by 
a likelihood ratio test. These methods are illustrated at the 
end of Section 4. 


4. NUMERICAL ILLUSTRATIONS 


For the purpose of numerical illustration we use data from 
the equal probability subsample of the US Panel Study of 
Income Dynamics (PSID). See Hill (1992). We consider the 
two states employed and not employed, coded 1 and 2 
respectively, thus restricting attention again to the binary 
variable case. For simplicity, we ignore non-response and 
consider the sample of 5,357 individuals aged 18-64 in 1986 
with complete values on the variables: employment status in 
1985, 1986 and 1987, car ownership, age, sex and education. 


We assess the properties of the IV estimator in two ways. 
First, in Section 4.1, we compare the bias and standard error 
of the IV estimator with the “unadjusted” estimator for 
hypothetical instrumental variables, with a range of different 
associations with x. Second, in Section 4.2, we consider the 
impact of using different actual PSID variables as 
instrumental variables. 


4.1. Bias and Standard Error Properties of Estimators 
for Hypothetical Instrumental Variables 


The parameters of primary interest are the joint 
probabilities pr(x =i, y =/) or the conditional probabilities 
pr(y =/ |x = 1) derived from these. The simple “unadjusted” 
estimators of these parameters are based on the corresponding 
sample proportions for the classified variables X and Y and 
have expectations pr(X = i, Y = /) under multinomial sampling. 
Since Pr(X =i, Y=/) differs in general from pr(x =/, y =/) the 
unadjusted estimators are typically biased. Provided the 
model assumptions (A1)-(A5) hold, the IV estimators of 
pr(x =i, y=/) will be asymptotically unbiased although their 
variances may be larger than those of the unadjusted 
estimators. The aim of this section is to investigate the extent 
to which there exists a trade-off in practice between the bias 
of the unadjusted estimators and the increased variance of the 
IV estimators. It will be assumed that the model assumptions 
(A1)-(A5) hold and that the sample is large enough for the IV 
estimator to be treated as unbiased. 

For the numerical investigation in this section we wish to 
use some “realistic” parameter values. These were determined 
by rounding the values of estimates for annual flows between 
the years 1986 and 1987 from analyses in Section 4.2 
(reported in Table 3). The values of the five free model 
parameters not involving W were set to be K,, =0.03, K,, = 
0.94, pr(x = 2) = x = 0.22, pry = 2, x = 1) = 8,(1- x) = 0.03 
and pr(y = 2,x =2)=6,n=0.19. Different values of the 
remaining two free parameters g,, = pr(W=1|x=1) and 
Q\»= pr(W=1|x=2) are set in the different columns of 
Table 1. Cramér’s V statistic, which measures the association 
between two binary variables, essentially by scaling the chi- 
square statistic to a [0,1] interval, is provided as a summary of 
the strength of association between the variables W and x. 
For each of the choices of parameter values, Table | displays 
the estimated standard errors of the IV estimators for the 
PSID sample size n = 5,357. Table | also contains the biases 
and standard errors of the unadjusted estimator for the same 
parameter values K,,, K,,, =, 0, and @, and the same sample 
size. 

To illustrate the calculation of the biases of the unadjusted 
estimators, consider pr(x=1, y=1). The expectation of the 
unadjusted estimator of this parameter is pr(X=1, Y=1), 
which is calculated from the given values of K,,, K,,, 7, 0, 
and @, and assumptions (A1)-(A5) as 0.71. This compares 
with the assumed value of pr(x =1, y=1) of 0.75. The bias is 
thus 0.71 — 0.75 = -0.04. The biases of the IV estimators are, 
as noted above, assumed to be zero. The standard errors of the 
unadjusted estimators are obtained from standard binomial 
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Table 1 
Biases and Standard Errors under Alternative Hypothetical IVs 


pr(W= 1 | nly) 

pr(W= 1 | y= 2) 

Cramér’s V 

Parameter pis oO} Unadjusted 
: of Unadjusted , 

Estimated : Estimator 

Estimator 

proe=15y=1) -4.0 0.62 

prix = 1, y= 2) 3.0 0.32 

pr(x = 2, y= 1) 3.0 0.32 

prix =25y =!2) =2.0 0.51 

pr(y=1|x=1) -3.9 0.37 


0.60 


12.4 


1.0 
0.0 
1.0 


0.68 
0.39 
0.32 
0.59 


0.50 


Parameter Values Assumed for IV estimator 


0.1 0.1 0.1 0.3 0.1 0.5 
0.9 0.7 0.5 0.7 0.3 0.3 
0.74 0.59 0.42 0.34 0.24 0.17 
Standard Errors (x 100) 
IV Estimator 
0.75 0.88 1-18 1.16 1.82 2.05 
0.43 0.51 0.64 0.69 1.03 1.24 
0.37 0.44 0.57 0.66 0.95 2d, 
0.65 0.73 0.89 1.06 1.42 1.99 
0.55 0.64 0.81 0.88 1.30 1.58 


2.56 


Note: 1 = employed, 2 = not employed; n = 5,357; multinomial sampling assumed; biases of IV estimators are zero. 


formulae. For example, the standard error of the unadjusted 
estimator of pr(x =1, y=1) is ¥0.71 x 0.29/5,357 = 0.0062, 
where 0.71 is the valueof Pr(X=1,Y=1). The standard 
errors of the IV estimators are obtained from the inverse of 
the expected information matrix, which is given by 
NY’ Pix His Where H,, is the 7x7 matrix of second 
derivatives of logp,, with respect to the seven free para- 
meters. Following differentiation, these parameters are set 
equal to their assumed values, as indicated above. Note that 
the standard errors obtained from the multinomial information 
matrix are likely to be under-estimates because of the 
complex sampling design employed in the PSID. 

There is a clear pattern of the standard errors of the IV 
estimator increasing as the association between W and x 
decreases. The amount of increase is fairly similar across all 
parameters, for example the ratio for V = 0.20 versus V = 1.00 
lies between 3 and 4 for all parameters. In all cases the 
standard error of the IV estimator is greater than that of the 
unadjusted estimator. The loss of efficiency of the “best” IV 
estimator (with perfect association between W and x) 
compared to the adjusted estimator varies between parame- 
ters. Roughly speaking, the loss is greater for the conditional 
parameters than for the unconditional parameters. This loss 
of efficiency might be interpreted as the effect of adjusting 
for measurement error in y, which is still necessary even when 
x is perfectly measured by W. Under this interpretation, the 
greater relative loss of efficiency for the conditional 
parameters seems plausible since these are “less dependent” 
on the parameters of the marginal x distribution which the W 
information helps to estimate. 

To examine the trade-off between the bias of the 
unadjusted estimator and the increased variance of the IV 
estimator we have calculated the minimum value of the 
sample size n necessary for the MSE of the IV estimator to be 


less than that of the unadjusted estimator. For complex 
designs the sample sizes should be interpreted as effective 
sample sizes. Table 2 gives these minimum values under a 
variety of strengths of association between W and x. If there 
were no misclassification the entries would all be infinity 
since the unadjusted estimators would always be more 
efficient than the IV estimators. For the assumed amount of 
misclassification given by K,, = 0.03 and K,, = 0.06, the 
sample size required increases rapidly as V decreases. The 
differences between the rows of Table 2 are partly accounted 
for by the differences between the rows of Table | and partly 
by differences between the biases of the unadjusted estimator. 
Thus, the bias of the unadjusted estimator of pr(x =2, y = 2) 
is relatively small and this leads to the large values in the 
corresponding row of Table 2. Note that the value of 1 for 
pr(x = 2, y = 1) and Cramér’s V = 1 arises because in this 
case the standard errors of the two estimators are equal (see 
Table 1) and so the bias of the unadjusted estimators implies 
that the IV estimator has smaller MSE for any n> 1. 

The main conclusion we wish to draw from Table 2, 
however, is simply that we may expect there to be a number 
of practical situations where IV estimation will be worth- 
while provided the model assumptions hold, even if the 
necessary sample sizes are inflated somewhat to allow for 
complex sampling designs. 


4.2 Results for Actual Instrumental Variables 


The results in the previous section were based on 
hypothetical instrumental variables. To provide a more 
realistic illustration we now consider possible real 
instrumental variables. The key problem is how to choose a 
variable W which obeys (A3) and (A4). It seems easier to 
find a variable which satisfies (A3) than (A4), in particular 
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Table 2 
Sample Size Necessary for MSE of IV Estimator to be less than that of Unadjusted Estimator 
(Multinomial Sampling) 


Value of Cramér's V assumed for IV estimators 


1.0 0.74 0.59 0.42 0.34 0.24 0.17 
gee iirc Sample size n required 
Estimated 
pr@e=t,y=1) 28 59 132 300 320 97] WA 
pr(x = 1, y=2) 3h 50 9] 184 219 Wis 843 
pree— 2) =) 1 20 51] 129 198 476 811 
Pix =27,—2) 112 227 366 720 1184 2397 5070 
pry =1| x=1) 42 60 oF 183 219 541 818 
pryy= 1) x= 2) 3if/ 81 121 216 281 633 1061 


measured without error obey (A3). However, it seems more 
difficult to find variables which one is sure are not related to 
change in employment status and hence obey (A4). 

For illustration, we have considered two possibilities. First 
we have taken W as car ownership (W = 2 if the individual 
owns a car, W = 1 if not). This variable is likely to be 
measured with some error but it seems a reasonable first 
assumption that this error is unrelated to errors in measuring 
employment status. For example, in an analysis of errors in 
recording car ownership in the 1981 British Census, Britton 
and Birch (1985, p. 67) conclude that “the main problems 
associated with the small number of discrepancies were those 
connected with either vehicles out of use or vehicles 
temporarily available — for example, those hired...” and it 
seems at least plausible that such errors need have little 
relation to the kinds of errors in recording employment status. 
On the other hand, it is plausible that car ownership acts as a 
proxy for some kind of social or economic status which is 
related to change in employment status so assumption (A4) 
seems more questionable. However, for our illustrative 
purpose we assume (A3) and (A4) hold. 

As a second illustration we have taken W to be the lagged 
employment status in 1985. A problem here is that (A4) 
effectively implies that individual employment histories 
follow Markov processes with common transition rates. In 
fact, transition rates will vary among individuals and this will 
invalidate assumption (A4) (e.g., van de Pol and Langeheine 
1990). Therefore, to allow for departures from assumption 
(A4), we disaggregated the sample into 16 groups defined by 
cross-classifying age (4 groups), sex and education (up to 
college level or not). We then assumed the model held within 
subgroups and used likelihood ratio tests to assess what 
parameters were constant across subgroups. These tests only 
provide a very rough guide since they ignore the complex 
sampling design of the PSID. There was no significant 
evidence of differences in the misclassification probabilities K, 
across subgroups. Furthermore, within each of the 8 sub- 
groups defined by age x sex there was no significant 
evidence of differences in Pr(W|x, subgroup) between the 


2 education subgroups. Assuming equality of these 
parameters gave a non-significant likelihood-ratio goodness- 
of-fit chi-squared value of 52.9 on 46 df (46 is obtained as 
the number of cells = 16 x 8 = 128, less 2K,, parameters, less 
16x 4 = 64 pr(x, y,subgroup) parameters, less 8 x 2 = 16 
pr(W |x, subgroup) parameters). Combining the parameter 
estimates for the disaggregated model appropriately gives 
estimates of the overall flows pr(x, y). 

Table 3 contains estimates of the key parameters for the 
two choices of instrumental variable and for the disaggregated 
version of the second choice. We note first that the standard 
errors for the IV estimator based on car ownership are 
relatively high. This may be expected from Table 1 since the 
association between x and W is low (Cramér’s V is 0.12). 
Even so, the resulting adjustments increasing the estimates 
for the diagonal entries are plausible and the confidence 
intervals resulting from this IV estimator seem more realistic 
than those for the unadjusted estimator. 


Table 3 
Unadjusted and IV Estimates for PSID Data 
IV Estimates 
Unadjusted 
Parameter Estimates BV > COIN SEED mn ace 
Ownership Employment Employment 
(Disaggregated) 
pre = 1s =1) 0.719 0.773 0.766 0.757 
(0.006) (0.033) (0.008) (0.007) 
pr(x = 1, y=2) 0.055 0.011 0.017 0.025 
(0.003) (0.020) (0.005) (0.003) 
Dig 2a 1h) 0.061 0.018 0.024 0.032 
(0.003) (0.019) (0.004) (0.003) 
pre =2)y=2) 0.166 0.198 0.193 0.186 
(0.005) (0.027) (0.007) (0.006) 
Note: Standard errors under multinomial assumptions in paren- 


theses. Disaggregation is by age (4 groups), sex and 
education (2 groups). 
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The standard errors for the second choice of instrumental 
variable are smaller, as expected since the association with X 
is now higher (Cramér's V is 0.73). Indeed these standard 
errors are not much larger than those for the unadjusted 
estimator. The (2 standard error) confidence intervals now do 
not overlap with the corresponding intervals for the 
unadjusted estimator for any of the four parameters. 

As noted earlier, assumption (A4) is questionable for the 
lagged employment variable. The disaggregated version of 
this estimator makes “weaker” assumptions by only requiring 
(A4) to hold within subgroups. The resulting estimates are 
seen to be fairly close to the original IV estimator and to have 
slightly smaller standard errors, perhaps attributable to the use 
of the additional information on sex, age and education (but 
see later discussion). It is interesting that the effect of the 
disaggregation is to diminish the effect of adjustment by a 
relatively small amount in each case. It seems plausible that 
departures from (A4) may tend to lead to overadjustment in 
the IV estimator and that the disaggregation approach here 
helps to overcome this bias and, for alternative choices of 
disaggregating variables, enables an assessment of the 
sensitivity of results to the model specification. 

As noted in Section 3 we have often come across IV 
estimates on the boundary of the interval [0,1]. Of the 
analyses reported in Table 3 in fact only the disaggregated 
analysis involved boundary estimates. For the 64 parameters 
pr(x =i, y =/, subgroup) for i, 7 = 1,2, subgroup = 1, ..., 16, 
five of the estimates were on the boundary (none of the 
estimates of the remaining 18 parameters, pr(W=1|X=1) 
and so forth, were). The standard errors reported in Table 3 
treat these parameters as known and hence may underesti- 
mate the uncertainty in the estimates of the aggregate 
pr(x = /, y =/) parameters. 


Table 4 
Alternative Estimates of Standard Errors 
for Males Aged 26-35 with no College Education 


Estimated Standard Error 


Parameter IV estimates Sear reins 
pr(W = 1| x=1) 0.947 0.011 0.011 
pr(W=1| x =2) 0.107 0.089 0.091 
pr(X=1| x=1) 0.969 0.006 0.007 
pr(X¥=1| x=2) 0.084 0.088 0.075 
pr(x = 1, y=1) 0.953 0.011 0.012 
pr(x = 1, y= 2) 0 * * 

pr(x = 2, y= 1) 0.006 0.007 0.006 
pr(x = 2, y= 2) 0.041 0.012 0.011 
pr(x = 1) 0.953 0.011 0.011 
pry =| x=1) 1 * * 

pr(y = 1| x = 2) 0.128 0.139 0.117 


Note: n = 455; “standard” estimators based on observed infor- 
mation matrix, treating parameters estimated at the boundary 
as known; 10,000 replications of bootstrap; multinomial 
assumptions. 
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Table 4 presents alternative estimates of the standard 
errors for one subgroup, males aged 26-35 with no college 
education. The estimate of pr(x =1, y=2) as well as derived 
estimates, such as pr(y=1|x=1) lie on the boundary. The 
“standard” estimates of the standard errors are, as in Table 3, 
based on the observed information matrix, treating parameters 
estimated at the boundary as known. Bootstrap standard error 
estimates (for 10,000 replications) are found to be very close 
to these standard estimates for parameters with estimates not 
on the boundary. For the IV estimate of pr(x = 1, y=2) at the 
boundary no standard estimate of the standard error is 
available. Indeed it seems to make little sense to estimate the 
standard deviation of the sampling distribution in this case. 
It seems more sensible to derive a one-sided confidence 
interval which may be done either using the profile likelihood 
method, which gives [0, .016], or using the bootstrap percen- 
tile method, which gives [0, .009]. The corresponding inter- 
vals for pr(y =1 |x =1) are [.983, 1] and [.990, 1]. 


5. CONCLUSION 


The presence of measurement error can induce substantial 
bias into standard estimates of transition rates from 
longitudinal data. If external estimates of misclassification 
rates are available then a variety of adjustment methods exist. 
If no such information is available then this paper shows how 
adjustment for measurement error alternatively can be carried 
out using instrumental variable estimation. 

The main problem, as in conventional instrumental 
variable estimation, is finding a variable which one can be 
confident satisfies the conditions required of an instrumental 
variable. Even if the conditions are satisfied then it is 
desirable, in order to obtain reasonable precision, that there be 
a fairly strong association between this variable and the true 
state. If such a variable can be found then instrumental 
variable estimation may be useful. 


ACKNOWLEDGEMENTS 


We are grateful to Wayne Fuller for suggesting the basic 
idea underlying this paper. Research was supported by grant 
number H519 25 5005 from the Economic and Social 
Research Council under its Analysis of Large and Complex 
Datasets programme. 


REFERENCES 


ABOWD, J.M., and ZELLNER, A. (1985). Estimating gross labor 
force flows. Journal of Business and Economic Statistics, 3, 
254-283. 


ANDERSON, T.W. (1959). Some scaling models and estimation 
procedures in the latent class model. Probability and Statistics, 
(Ed. U. Grenander). Stockholm: Wiksell and Almquist. 


60 Humphreys and Skinner: Instrumental Variable Estimation of Gross Flows 


BAKER, S.G., and LAIRD, N.M. (1988). Regression analysis for 
categorical variables with outcome subject to nonignorable 
nonresponse. Journal of the American Statistical Association, 
83, 62-69. 


BARTHOLOMEW, D.J. (1987). Latent Variable Models and Factor 
Analysis. London: Griffin. 


BIEMER, P.P., GROVES, R.M., LYBERG, L.E., MATHIOWETZ, 
N.A., and SUDMAN, S. (1991). Measurement Errors in Surveys 
New York: Wiley. 


BRITTON, M., and BIRCH, F. (1985). J/98/ Census Post- 
Enumeration Survey. London: Her Majesty’s Stationery Office. 


CHUA, T., and FULLER, W.A. (1987). A model for multinomial 
response error applied to labor flows. Journal of the American 
Statistical Association, 82, 46-51. 


DURBIN, J. (1954). Errors in variables. Review of the International 
Statistical Institute, 22, 23-31. 


EDLEFSEN, L.E., and JONES, S.D. (1984). Reference Guide to 
GAUSS. Applied Technical Systems. 


FORSMAN, G., and SCHREINER, I. (1991). The design and 
analysis of reinterview: an overview. In Measurement Errors in 
Surveys. (Eds. Biemer, P.P., Groves, R.M., Lyberg, L.E., 
Mathiowetz, N.A., and Sudman, S.). New York: Wiley. 


FULLER, W.A. (1987). Measurement Error Models. New York: 
Wiley. 
GOODMAN, L.A. (1974). Exploratory latent structure analysis 


using both identifiable and unidentifiable models. Biometrika, 
61, 215-231. 


HILL, M.S. (1992). The Panel Study of Income Dynamics: A User's 
Guide. Newbury Park, CA: Sage. 


HOGUE, C.R., and FLAIM, P.O. (1986). Measuring gross flows in 
the labor force: an overview of a special conference. Journal of 
Business and Economic Statistics, 41, 111-21. 


MADANSKY, A. (1960). Determinental methods in latent class 
analysis. Psychometrika, 25, 183-198. 


MARQUIS, K.H., and MOORE, J.C. (1990). Measurement errors 
in the Survey of Income and Program Participation (SIPP): 
Program Reports. Proceedings of the 1990 Annual Research 
Conference. US Bureau of the Census, 721-745. 


MEYER, B.D. (1988). Classification-error models and labor-market 
dynamics. Journal of Business and Economic Statistics, 6, 
385-390. 


POTERBA, J.M., and SUMMERS, L.H. (1986). Reporting errors 
and labor market dynamics. Econometrica, 54, 1319-1338. 


REIERSOL, D. (1941). Confluence analysis by means of lag 
moments and other methods of confluence analysis. 
Econometrica, 9, 1-24. 


SINGH, A.C., and RAO, J.N.K. (1995). On the adjustment of gross 
flow estimates for classification error with application to data 
from the Canadian Labour Force Survey. Journal of the 
American Statistical Association, 90, 478-488. 


SKINNER, C.J. (1989). Domain means, regression and multivariate 
analysis. In Analysis of Complex Surveys, (Ch. 3) (Eds. 
Skinner, C.J., Holt, D., and Smith, T.M.F.). Chichester: Wiley. 


SKINNER, C.J., and TORELLI, N. (1993). Measurement error and 
the estimation of gross flows from longitudinal economic data. 
Statistica, 53, 391-405. 


VAN DE POL, F., and DE LEEUW, J. (1986). A latent Markov 
model to correct for measurement error. Sociological Methods 
and Research, 15, 118-141. 


VAN DE POL, F., and LANGEHEINE, R. (1990). Mixed Markov 
latent class models. In Sociological Methodology 1990, (Ed. 
C.C. Clogg). Oxford: Basil Blackwell, 213-247. 


VAN DE POL, F., LANGEHEINE, R., and DE JONG, W. (1991). 
PANMARK User Manual. Panel analysis using Markov chains. 
Version 2.2. Netherlands Central Bureau of Statistics. 


Survey Methodology, June 1997 
Vol. 23, No. 1, pp. 61-71 
Statistics Canada 


61 


Geographic-Based Oversampling in Demographic Surveys 
of the United States 


JOSEPH WAKSBERG, DAVID JUDKINS and JAMES T. MASSEY’ 


ABSTRACT 


Often one of the key objectives of multi-purpose demographic surveys in the U.S. is to produce estimates for small domains 
of the population such as race, ethnicity, and income. Geographic-based oversampling is one of the techniques often 
considered for improving the reliability of the small domain statistics using block or block group information from the 
Bureau of the Census to identify areas where the small domains are concentrated. This paper reviews the issues involved 
in oversampling geographical areas in conjunction with household screening to improve the precision of small domain 
estimates. The results from an empirical evaluation of the variance reduction from geographic-based oversampling are 
given along with an assessment of the robustness of the sampling efficiency over time as information for stratification 
becomes out of date. The simultaneous oversampling of several small domains is also discussed. 


KEY WORDS: Sample design; Stratification; Rare populations. 


1. INTRODUCTION 


The sponsors of many broad multi-purpose demographic 
surveys require separate analyses of domains defined by race, 
ethnicity and income. Equal probability samples generally do 
not provide sufficient sample sizes for some of these domains 
to yield the precision needed, making some form of 
oversampling necessary. This requirement poses interesting 
methodological problems since there is no registry of the U.S. 
population from which samples stratified by these domains 
can be drawn. Housing lists containing identifiers for these 
domains are maintained at the Bureau of the Census, but they 
are not available to researchers outside of the Bureau. For 
surveys requiring face-to-face interviews, outside researchers 
are thus forced to use area sampling techniques. Even within 
the Bureau, geography is sometimes used as the basis of 
oversampling since the lists are only updated once every ten 
years. This paper describes efficient methods for over- 
sampling the aforementioned domains in the context of area 
sampling. 

Data from the U.S. Decennial Census on concentrations of 
various demographic domains are publicly available for small 
geographic units; race and ethnicity are reported for every 
block and income for every block group. (A “block” is an 
area bounded on all sides by roads and not transected by any 
roads. Block groups are combinations of several neigh- 
bouring blocks.) These data may be used to inexpensively 
improve the precision of statistics about rare domains by 
oversampling blocks or block groups that contain higher than 
average concentration of members of rare domains and then 
dropping or subsampling screened persons not in the targeted 
rare domains. The general theory for this type of sample 
design was worked out by Kish (1965, Section 4.5). An 
independent presentation of the theory with examples from 


the 1960 Decennial Census was given by Waksberg (1973). 
Further examples and a discussion of alternative methods are 
given by Kalton and Anderson (1986) and by Kalton writing 
for the United Nations (1993). In this paper, we extend prior 
illustrations to cover more domains, update results to 1990, 
and evaluate empirically the robustness of these methods over 
time. 

We first briefly review the issues involved with screening 
and subsampling persons not in the targeted domains. Then 
we review the theory for optimal allocation where the strata 
are defined in terms of the density of rare populations and 
apply this theory to several rare populations. The main part 
of the paper is an empirical evaluation of the reduction in 
variance reduction from the geographic oversampling of 
various minority and other rare populations as well as how 
robust the variance reductions are over time. We also discuss 
the special problems involved with simultaneous targeting of 
several rare populations before summarizing our conclusions. 


2. SURVEY COST STRUCTURE AND THE 
SCREENING DECISION 


Let U stand for some target universe such as persons or 
households for which a sampling frame exists. Let D stand 
for some small domain of particular interest such as black 
persons that cannot be separately identified from the balance 
of U at the time of sampling. Let Y be a vector of 
characteristics of interest such as annual income, employment 
status, and number of doctors’ visits in the last year. In some 
surveys, the only objective is estimation of the distribution of 
Yon D. Insuch surveys, members of U-D that are discovered 
in the course of screening sampled members of U will be 
dropped from the sample. A general inexpensive interview 
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questionnaire is used for the screening to determine who is 
eligible for a full questionnaire. 

In other surveys, estimation of the distribution of Y on D 
and on U are both important objectives. For such a survey, at 
least some of the members of U-D that are discovered in the 
course of screening interviews will be retained for full 
interviews. If geographic-based oversampling is used, the 
initial sample will contain an oversample of those members of 
U-D who happen to reside in areas with heavy concentrations 
of D. Even when U-D is of interest, this oversampling of U-D 
in areas with high concentrations of D is usually undesirable 
since resulting variation in probabilities of selection for U-D 
leads to unnecessarily large design effects for statistics both 
about U and about U-D. These larger design effects mean 
that the extra sample size for U-D will usually result in only 
a trivial decrease in variances for statistics about U-D. 
Generally, the funds expended on the extra interviews with 
U-D would be better spent on increasing the total initial 
sample size. 

It is fairly easy to set up subsampling procedures that result 
in an equi-probability sample of U-D. The subsampling can 
be done centrally after the completion of the entire screening 
operation, or it can be done by the interviewer while still in 
the sample household after obtaining data on household 
composition. Techniques have been developed that make the 
subsampling process very easy for the interviewer (Waksberg 
and Mohadjer 1991). Interviewers do not need to be trained 
to carry out random draws. With paper and pencil survey 
instruments, interviewers are given house-by-house pre- 
interview instructions about which domains can be inter- 
viewed at which households. These instructions are 
randomized centrally prior to screening to yield the desired 
sampling rates. Alternatively, with CAPI, the subsampling 
can be programmed and carried out automatically in the 
laptop computer used for CAPI; the computer notifies the 
interviewer which households are to be retained for the full 
interview and which ones to reject as a result of subsampling. 

Whether it is better to keep all sampled members of U-D 
or to subsample them depends on the relative sizes of U and 
U-D, the precision requirements for both and on the relative 
costs of full interviews and the shorter screening interviews. 
Let c# be the variable cost associated with sampling a single 
member of U and collecting and processing all data of interest 
about that member. Let c’ be the variable cost associated with 
sampling, screening, and then dropping a single member of U. 
Let c =cx/c’, be the ratio of the cost of a full interview to the 
cost of a screening interview. If c is much greater than 1, then 
subsampling should be considered for the survey that has 
interest in U-D even though subsampling of U-D will 
introduce some additional complexity into survey operations. 
Given that the full interview is by definition longer that the 
screening interview, it should always be the case that c is at 
least slightly greater than 1. On panel and longitudinal 
surveys, the cost of all follow-back interviews should be 
counted as part of cx, typically making the cost of a full 
interview many times larger than the cost of a screening 


interview; i.e.,c >> 1. The same will be true of surveys that 
involve the collection of physical specima requiring 
expensive laboratory work and of surveys that require 
expensive experts (such as medical doctors) to participate in 
the primary data collection. For such surveys, we would 
highly recommend that geographic-based oversampling not be 
employed by itself, but rather, in conjunction with screening 
and subsampling. For a door-to-door survey with a single 
interview by a standard grade interviewer (trained to ask 
questions and record answers but not to make any technical or 
anthropological assessments), c is frequently in the range of 
3 to 5. This is large enough in many applications to justify 
the complication of subsampling U-D in oversampled areas. 


3. FORMING THE STRATA 


We assume that even though D cannot be separated from 
U at the time of sampling, there is some information available 
about the distribution of D and U across a set of 
geographically defined entities. In the United States, the 
natural entities are blocks or block groups (BGs) and 
information for these entities is supplied by the decennial 
census. (Prior to the 1990 decennial census, blocks were not 
defined in rural areas; larger entities called “enumeration 
districts” were used for oversampling.) The U.S. Bureau of 
the Census makes data on the racial and ethnic composition 
of blocks publicly available along with mapping information 
so that these blocks can be identified years later by any survey 
organization. Income data are only made available at the BG 
level. 

Standard practice calls for the stratification of the blocks 
or BGs by the local concentration of D. Thus, all blocks 
where D constitutes less that 10 percent of the block’s total 
population might constitute one stratum. Further cutpoints 
for defining the strata might be 30 percent, and 60 percent, 
yielding a total of four strata. There has been little empirical 
study of the optimal number of strata nor of the optimal 
cutpoints. In general, more strata will yield more efficient 
designs, but, at some point, the operational complexities of a 
large number of strata outweigh the gains in efficiency. 
Conventional wisdom dating back to Kish (1965) holds that 
a fairly small number of strata will achieve most of the gains 
attainable through stratification. 


4. OPTIMAL ALLOCATION FOR A SINGLE 
DOMAIN 


Our objective is to adapt the general formulas for optimum 
allocation of a stratified sample to apply to the reduction in 
variance due to geographic-based oversampling. The 
derivations are essentially those given by Kish (1965) using 
the notation of Kalton in United Nations (1993). Let the 
population be divided into a number of strata as discussed 
above. Let N be the size of the total population and N, be the 
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size of the total population within the /-th stratum. Let P, be 
the proportion of the A-th stratum that consists of members of 
D. Let P be the overall proportion of the population that 
belongs to D. We may use the prior decennial census to 
estimate P, and P, or we may use some more recent large 
survey that carried block and/or BG codes for every sample 
household/person so that matching to the last decennial 
census will yield the stratum identification for every sample 
household/person. 

We assume that c is constant across the strata even though 
this may sometimes not be very accurate. For example, 
interviewing in blocks with high concentrations of American 
Indians, Eskimos or Aleuts almost always means interviewing 
in remote locations with difficult transportation issues. 
However, estimation of even a national average for c is 
difficult for most survey operations. It will not generally be 
possible to get estimates by stratum. 

We also assume that the distribution of Y on D is constant 
across the strata. More specifically, we assume that 


E(Y|Dandh)=E(Y¥|D) and that 


Var(¥|Dand h) = Var(Y¥| D), 


where the expected value and variance are with respect to the 
population, not the sample design. This is usually not a very 
good assumption, but given a vector of characteristics of 
interest, the components of the vector will usually behave 
differently across the strata so there is no point in trying to be 
more exact. Lastly, we assume that the sampling fractions are 
small enough in all the strata to make the finite population 
correction factors ignorable. 

Given these assumptions, the optimal sampling fraction for 
the h-th stratum for a survey where all screened members of 
U-D are dropped is 


7 ei (1) 
PAID 


where k is a constant determined by either precision 
requirements or budget constraints. (For a proof of (1), see 
either of the sources referenced above. This allocation rule is 
an application of Neyman allocation.) If c=1, (ie, 
screening is aS expensive as interviewing), then this 
proportionality reduces to Lye which can yield 
allocations quite different from an equi-probability sample 
across strata. However, if the cost of screening is far less 
than the cost of interviewing (i.e, c>>1) and D is not 
extremely rare (i.e., P, is not close to zero), then this 
relationship results in close to a flat set of sampling intervals, 
which is equivalent to allocation in proportion to total 
population. 

Given a fixed budget of B, k is determined by the cost 
equation 


63 
B= » Ne Pye A= PD) (2) 


To obtain a simple random sample of size n from domain D 
would require selecting a screening sample of size n/P, 
resulting in a total cost of 


B=nee' «(Zn (3) 


By equating these two costs, we can solve for the constant of 
proportionality in (1) and get: 


re Alte 
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To calculate the benefits of this allocation realistically, it 
is necessary to acknowledge the fact that the estimates of P, 
that are used to guide the allocation will be somewhat out of 
date by the time that the survey is actually conducted. Let A, 
be the proportion of D actually to be found within the A-th 
stratum at the time of sampling and data collection. It is 
assumed that P is unchanged even though the distribution 
across strata changes according to A,. By letting NP = Np and 
Np A, = Np, it can readily be shown that the actual sample 
Size, Np, that will be achieved on D is given by 


ee » NPA, fy: (5) 


From Kish (1965), this sample will have higher variance 
than a simple random sample of the same size on D. The 
variance inflation factor or design effect associated with the 
differential sampling rates across strata is the well-known 


deff = ( ¥ 4,5) [s 4,/'| (6) 


Thus, the effective sample size associated with the 
geographic-based oversampling is 


de (S4,/s) (7) 


Substitution of formulae (1) and (4) into (7) yields 
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This formula allows us to compare the variance for an 
arbitrary statistic on domain D given geographic-based 
oversampling with the variance for the same statistic given a 
simple random sample of D of the same total cost B. Formula 
(8) can be rewritten algebraically such that the proportion of 
simple random sample variance that is eliminated by the 
geographic-based oversampling is given by 


o _ o deff 


A (9) 
[e- 1 4) 
P 


It is definitely possible for this reduction to be negative, 
meaning that a simple random sample would have provided 
lower variance for the same cost. This is most likely to 
happen when there exists a stratum for which NPA,>>N, P,, 
meaning that there exists a stratum which was thought to have 
a very small portion of D but, in fact, has quite a significant 
portion of D. Note that if P, =P, then no variance reduction 
can be expected from geographic-based oversampling. Also, 
as c goes to infinity for fixed P (equivalent to screening 
becoming cheaper and cheaper relative to full interviews), the 
variance reduction approaches zero. Given the extra 
complication of a stratified sample, this means that for large 
c and moderate P, the sample designer should consider 
drawing a simple random sample instead of a stratified 
sample. Geographic-based oversampling increases in value 
as P approaches zero, c approaches 1, and D becomes more 
concentrated in a single stratum. As the small domain of 
interest, D, becomes more concentrated in a single stratum the 
sample becomes more efficient, since there are fewer cases 
from D in the remaining strata with large differential. The 
potential reductions in variance due to geographic-based 
oversampling under a number of conditions are shown 
empirically for several demographic domains in the section 
below. 


5. EMPIRICAL EVALUATION 


Equation (9) is quite difficult to evaluate for domains of 
interest. Data on P, can be obtained from summary tapes 
from the decennial censuses that are published at the block, 
block group, and enumeration district levels by the Bureau of 
the Census. This allows one to define reasonable strata and 
to evaluate equations (1) through (4). If one were to assume 
that the P, are static over time, then the rest of the equations 
could also be evaluated. However, Americans tend to move 
frequently, and the racial and ethnic composition of many 


blocks change in that process (Judkins, Massey and Waksberg 
1992). To the extent that members of D move into areas 
where they were previously not common, the benefits of the 
geographic-based oversampling diminish. Not wishing to 
overstate the benefits of the procedure, we searched for some 
method to get reasonable estimates of the A, at postcensal time 
points. Matching block- or BG-level data for two consecutive 
censuses might appear to be a good solution but is not 
possible. Up to now, blocks have been defined and labelled 
independently from census to census with no attempt to 
preserve definitions for longitudinal. Thus, alternate 
information sources are required to estimate A,. 

For the analysis of the benefits of geographic-based 
oversampling for the black and Hispanic populations, micro- 
level data from current household surveys conducted by the 
Census Bureau turned out to be a good source of information 
on the A,. Specifically, we used data from the 1988 National 
Health Interview Survey (NHIS). Staff at the Census Bureau 
prepared a special tape for us that gave the 1980 block group 
or enumeration district code for almost all households 
interviewed in the 1988 NHIS in residences built prior to 
1980. (Residences constructed during the 1980s would have 
been sampled for the NHIS from building permits rather than 
by area sampling. Due to technical difficulties, block and 
block group labels are not attached to such sample dwellings.) 
We then matched the 1988 NHIS against 1980 Census 
summary files by block group or enumeration district in order 
to classify NHIS households into strata defined by 
concentrations of blacks and Hispanics in 1980. Using 
survey weights, we were then able to estimate the distribution 
of various domains across those strata. (Housing built during 
the 1980s was assumed to be in the stratum with the lowest 
concentration of the rare domains.) Similar operations could 
have been carried out for Asians, Pacific Islanders, American 
Indians, Eskimos, Aleuts, and persons with low income but 
were not. 

Tables and charts in the balance of the paper will refer to 
data at several points in time and from several sources. It is 
useful to bear in mind that the data used to form the strata do 
not have to be the same as the data used to allocate the 
sample, and that the data used to evaluate the sample may be 
from a third point in time or source. We have the following 
combinations in this paper: 

Source of 


Label Source of Source of 
stratification data | allocation data | evaluation data 

80/80/80 BG 1980 Census 1980 Census 1980 Census 
(BG level) 

80/80/88 BG 1980 Census 1980 Census 1988 NHIS 
(BG level) 

80/88/88 BG 1980 Census 1988 NHIS 
(BG level) 

90/90/90 BG 1990 Census 1990 Census 1990 Census 
(BG level) 

90/90/90 blk 1990 Census 1990 Census 1990 Census 

(block level) 


1988 NHIS 
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Table 1 
Residential Clustering of Blacks 
aeolian — . “einhond Percentage of blacks living in the stratum in Percentage of the total population living in 
percent of the stratification unit in i, : agie 
: ; the indicated year the stratum in the indicated year 
the year of stratification) 
Measurement year 1980 1988 1990 1990 1980 1988 1990 1990 
Stratification year 1980 1980 1990 1990 1980 1980 1990 1990 
Stratification unit BG/ED BG/ED BG Block BG/ED BG/ED BG Block 
< 10% Si 20.5 12.0 8.5 78.2 81.4 IBS TIES 
10-30% 13.5 Ihave? 16.8 13.9 8.9 si 11.4 9.6 
30-60% 18.9 20.4 20.3 16.2 5.1 5.1 5.7 4.5 
60-100% 57.9 45.9 51.0 61.4 7.8 6.4 Te? 8.4 
Total populations (1000s) 26,495 29,380 29.986 29,986 226,546 240,876 248,710 248,710 
Blacks as percent of nation in 
measurement year WT 12.0 12a 12 


Sources: 1980 Decennial Census (Westat tabulation) 


1988 National Health Interview Survey (Westat tabulation) 


1990 Decennial Census (Westat tabulation) 


6. OVERSAMPLING THE BLACK POPULATION 


Table 1 shows various aspects of residential segregation 
for the black population in the U.S. that are important to know 
about when designing a population survey. Although the 
percentage of blacks living in densely black (60+ percent) 
block groups declined between 1980 and 1990, it is clear that 
blacks were still strongly segregated. The columns about the 
population in 1988 are particularly important since they show 
the dynamics of the stratification data over time. By 1988, the 
percentage of the black population living in the block groups 
that were less than 10 percent black in 1980 had doubled, 


Variance Reduction Relative to SRS of Same Cost 
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Stratification Unit 
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Figure 1. Variance Reduction from Geographic-based Oversampling 
for Blacks 


from just 9.7 percent of blacks to 20.5 percent. This has 
major implications for the efficacy of geographic-based 
oversampling as will be shown below. It is also interesting to 
note that the total population in the block groups that were 
densely black (i.e., over 60% black) in 1980 actually declined 
by about 2 million persons between 1980 and 1988. At least 
part of this shift came from abandonment of some old housing 
and neighbourhoods. Concentration levels are sharper at the 
block level than at the block group level in 1990, as would be 
expected. (Block level data are not available for the whole 
nation from 1980.) Although sampling blocks is slightly 
more costly than sampling block groups (due to the larger 
number of blocks and the need to make provisions for blocks 
that have fewer inhabitants than the desired sample cluster 
size), it does allow sharper focus on the targeted domain. 

Figure 1 summarizes the implications of the density data 
shown in Table 1 for oversampling blacks. This figure shows 
the substantial effect of c on the efficiency of geographic- 
based oversampling. For values of c beyond 20, the best way 
to sample the black population is probably just to screen an 
equi-probability sample. 

The figure also illustrates the danger of relying upon the 
stratification data to evaluate the benefits of geographic-based 
oversampling. The 80/80/80 line shows the variance 
reductions that could be made if there were no change over 
time in the distribution of the black population across the 
density strata defined in terms of 1980 block group data. The 
80/80/88 line shows the actual variance reductions that are 
possible in 1988 for the same strata and allocation. At c =5, 
the variance reduction given a static distribution is 26 percent, 
while the variance reduction given observed changes in the 
distribution is just 16 percent. We examined whether 
allocating the sample across the old strata according to new 
distribution data could improve the actual variance reduction 
in 1988. The answer is yes, but not by much. The 80/88/88 
shows the variance reductions that are possible using the 1988 
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distribution across the 1980 strata to guide the allocation for 
a survey conducted in 1988. At c =5, the variance reduction 
given this allocation is 18 percent, a very modest impro- 
vement over the 16 percent variance reduction possible with 
the allocation guided by the old distribution. This led us to 
conclude that the major problem was the old stratification 
itself. By 1988, the extent of migration by the black 
population from block groups that were densely black in 1980 
into block groups that had lower concentrations of black 
populations in 1980 was so great as to cut the variance 
reduction achievable through oversampling almost in half. 
The shift of the black population into block groups with lower 
concentrations of blacks in 1980 results in more sample 
blacks with large weights thus increasing the variability 
among weights which increases the variance. Nonetheless, 
the variance reductions indicated by the 80/80/88 line for 
c < 10 are certainly large enough to be useful. 

Turning attention to the 1990 data in Figure 1, we observe 
that the 90/90/90 BG line is consistently several points below 
the 80/80/80 line, indicating that geographic oversampling at 
the block group level is likely to be slightly less useful during 
the 1990s than it was during the 1980s. This is a reflection of 
the slight reduction in segregation of the American black 
population in 1990 compared to 1980 noted above. On the 
other hand, the 90/90/90 bik line is almost exactly the same as 
the 80/80/80 line, indicating that the geographic oversampling 
at the block level can be expected to be as effective during the 
1990s as it was at the block group level in the 1980s. 
Although data have not yet been collected on the distribution 
of the black population in the late 1990s across 1990 density 
strata, we would expect that migration has continued and that 
therefore the gains indicated by the 1990 lines should 
probably be reduced (along the general trend indicated by the 
80/80/88 line) when projecting savings into the late 1990s and 
the first few years after 2000. 


7. OVERSAMPLING HISPANICS 


Table 2 shows various aspects of residential segregation 
for Hispanics in the U.S. that are important to know about 
when designing a population survey. Several points are 
interesting to note. First, it appears that Hispanics (unlike 
blacks) became slightly more segregated between 1980 and 
1990. Other patterns, however, are similar for the black and 
Hispanic populations. In 1980, 30 percent of the Hispanic 
population lived in block groups that were 60 percent or more 
Hispanic. By 1988 these same block groups contained only 
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Figure 2. Variance Reduction from Geographic-based Oversampling 
for Hispanics 


Table 2 
Residential Clustering of Hispanics 


Density stratum (Hispanics as a 
percent of the stratification unit in 
the year of stratification) 


Percentage of Hispanics living in the stratum 
in the indicated year 


Percentage of the total population living in 
the stratum in the indicated year 


Measurement year 1980 1988 1990 1990 1980 1988 1990 1990 

Stratification year 1980 1980 1990 1990 1980 1980 1990 1990 

Stratification unit BG/ED BG/ED BG Block BG/ED BG/ED BG Block 

<5% 14.8 293 10.6 6.6 76.8 79.8 68.4 68.9 

5-10% 9.6 v)5) 8.7 8.1 8.8 Well 10.9 10.3 

10-30% 22.6 PNP DES D2 8.5 7.4 11.8 11.5 

30-60% Pay Jk 18.8 24.1 2353 3.5 3.0 Sill 4.9 

60-100% 30.0 PNA 333) 9) 39.8 2.4 2.0 3.8 4.4 

Total populations (1000s) 14,609 197393 Dbbekey\ AY}. 2 isyi 226,546 240,876 248,710 248,710 
Hispanics as percent of nation in 

measurement year 6.4 8.1 9.0 9.0 


Sources: 1980 Decennial Census (Westat tabulation) 
1988 National Health Interview Survey (Westat tabulation) 
1990 Decennial Census (Westat tabulation) 
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about 21 percent of the Hispanic population. In contrast, the 
percent of Hispanic population living in the 1980 block 
groups that were less than 5 percent Hispanic increased from 
15 percent in 1980 to 29 percent in 1988. These changes 
reflect both a shift of the Hispanic between areas and the 
increase in the Hispanic population coming into the United 
States. The restratification of the Hispanic population using 
1990 data shows patterns similar to the 1980 distribution 
patterns. 

Figure 2 summarizes the implications of these segregation 
data on oversampling schemes. The curves show the same 
general patterns as the black curves. Geographic-based 
oversampling appears to be a useful tool for values of c < 10. 
Again though, it is important to be mindful of the effect of 
migration on the variance reduction. The gap between the 
80/80/80 and 80/80/88 lines is greater for Hispanics than for 
blacks, particularly forc <5. At present, we do not have a 
good basis for predicting whether this will be as true in the 
1990s as it was in the 1980s. 
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8. OVERSAMPLING OTHER RACIAL 
MINORITIES 


Tables 3 and 4 show segregation data for Asians and 
Pacific Islanders and for American Indians, Eskimos and 
Aleuts, respectively. Figures 3 and 4 show corresponding 
implications for oversampling these domains. Data from 
1980 and 1988 were not tabulated for this work because the 
1990 data are not encouraging for the inexpensive 
oversampling of these populations even with the use of 
stratification by density. The percent reductions in variance 
are quite large, greater than those for the black and Hispanic 
populations, since the amount of screening that would 
otherwise be required is much larger. However, the rarity of 
these populations in the U.S. means that very large screening 
samples are still required in order to get respectable 
interviewed sample sizes. For example, with a cost ratio of 3, 
even with geographic-based oversampling, it is necessary to 
screen 61,000 persons (or about 24,000 households) in order 


Table 3 
Residential Clustering of Asians and Pacific Islanders 


Density stratum (Asians and Pacific 
Islanders as a percent of the 1990 block 
or block group in 1990) 


Stratification unit: BG 
<5% 30.5 
5-10% M72: 
10-30% 27.8 
30-60% 14.6 
60-100% 9.8 
Total population (1000s) 6,968 
Asians and Pacific Islanders as percent 
of nation in measurement year 2.8 


Sources: 1990 Decennial Census (Westat tabulation) 


Percentage of Asians and Pacific Islanders 
living in the stratum in 1990 


Percentage of the total population living 
in the stratum in 1990 


Block BG Block 
19.4 86.4 85.2 
17.7 Yea 7.4 
Boal 5.0 Syl 
18.0 1.0 eS 
13.0 0.4 0.5 

6,968 248,710 248,710 
2.8 


Table 4 
Residential Clustering of American Indians, Eskimos and Aleuts 


Density stratum (American Indians, 
Eskimos and Aleuts as a percent of the 
1990 block or block group in 1990) 


Stratification unit: BG 
<5% 50.3 
5-10% 7.4 
10-30% 12.4 
30-60% 6.0 
60-100% 23.8 
Total population (1000s) 179} 
American Indians, Eskimos and Aleuts as 
percent of nation in measurement year 0.7 


Sources: 1990 Decennial Census (Westat tabulation) 


Percentage of American Indians, 
Eskimos and Aleuts 
living in the stratum in 1990 


Percentage of the total population living 
in the stratum in 1990 


Block BG Block 
34.6 98.3 97.4 
12.1 0.8 1.4 
15.9 0.6 0.8 

Us 0.1 0.1 
29.6 0.2 0.2 
1,793 248,710 248,710 
0.7 
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to obtain a sample of American Indians, Eskimos and Aleuts 
with precision equal to a (theoretical) simple random sample 
of 1,000 persons from this domain. (Of course, to success- 
fully screen 24,000 households, more housing units would 
have to be selected to allow for vacants and nonresponse). 
The comparable number for Asians and Pacific Islanders is 
18,000 persons or roughly 7,000 households. 
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Figure 3. Variance Reduction from Geographic-based Oversampling 
for Asians and Pacific Islanders 


Variance Reduction Relative to SRS of Same Cost 
60% 


Stratification Year/ 
Allocation Year/ 
Evaluation Year/ 
Stratification Unit 


’ 90/90/90 blk 
40% —— 90/90/90 BG 


50% 


30% 


20% 


10% 


0 5 10 15 20 25 30 35 40 
Ratio of the Cost of 1 Full Interview to 1 Screener 


Figure 4. Variance Reduction from Geographic-based Oversampling 
for American Indians, Eskimos and Aleuts 


9. OVERSAMPLING THE POOR 


Table 5 shows the 1990 distribution of the low income 
population by block groups classified according to the 
proportion of low-income population in the BG. The BGs in 
each of the classes depends on the definition of low income. 
The figures shown in the table are the percentages of low- 
income persons in each class. Table 5 shows a rather flat 
distribution of low income among the classes for all three 
definitions in 1990. Data (not shown) from the 1970 
decennial census and the Current Population Survey indicate 
that segregation of persons below the poverty level increased 
between 1970 and 1990 (Waksberg 1995), but the segregation 
is still far less than the segregation of racial and ethnic 
groups. The concentrations are somewhat greater for persons 
under 150 percent than for the other two definitions but, 
even for this group, it is considerably less than for racial and 
ethnic groups. As can be seen, with this definition, only 
about 25 percent of the poor live in BGs where 50 percent or 
more of the population is poor. The comparable percentages 
are 19 percent for persons below 125 percent of poverty and 
only 13 percent for persons below 100 percent of poverty. 
Such distributions imply that oversampling households in the 
strata with relatively high percentages of low-income persons 
will not be much better than oversampling and screening the 
entire sampling frame unless the full interview costs are only 
slightly higher than screening costs. 

Figure 5 shows the ratio of the variance of the optimum 
sample to an SRS at the same cost, for statistics relating to the 
low-income populations. Interestingly, despite the greater 
concentration associated with the broadest definition of low 
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Figure 5. Variance Reduction from Geographic-based Oversampling 
for Persons with Low Income 
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Table 5 
Residential Clustering of the Low Income Population 


Density stratum (Persons with low 

income as a percent of 1990 block 

group in 1990 according to various 
definitions of low income) 


Percentage of persons with low 
income living in the stratum in 1990 


Low income definition: $< Poverty $< 125% 
of Poverty 
<5% 5.8 Be. 
5-10% 123 8.3 
10-20% 24.8 21.0 
20-30% 19.8 20.2 
30-40% 14.3 S19 
40-50% 10.0 122 
50-100% 13.0 1933 
Total populations (1000s) 31,797 42,316 
Persons with low income as percent 
of nation in measurement year 12.8 17.0 


Sources: 1990 Decennial Census (Westat tabulation of STF-3) 


income, the reduction in variance for geographic-based 
oversampling is strongest for the narrowest definition because 
it requires more screening and thus has more to gain from a 
sampling strategy that reduces screening. For all three 
definitions, there appear to be moderate advantages to 
oversampling when c is under 3 or 4, about a 10 or 15 percent 
reduction in variances. When c is as large as 10, the gains are 
‘very slight, and there is virtually no advantage to 
oversampling BGs with high levels of poverty when c is 20 or 
larger. Of course, migration must be taken into account here 
as well, but we did not obtain the necessary data. Due to the 
effects of migration, the actual variance reductions will 
almost certainly be smaller than those shown in the chart. 
Furthermore, the income data in the 1990 Census are based on 
a one-sixth sample. The sample size in a typical block group 
was a little under 100 households. The classification of 
blocks according to percentage of low-income persons 
therefore has a fair amount of fuzziness to it, and many block 
groups will not be in the categories that Census data assign 
them, but in neighbouring classes, further weakening the 
variance reductions that can be achieved with geographic- 
based oversampling. As a result of these factors, it is unlikely 
that geographic-based oversampling will improve the 
efficiency. In fact, by mid-decade or later, it may actually 
result in an increase in variance. A related unpublished study 
by Waksberg in 1989 showed similar results when 
considering the possibility of merging ZIP-code level 
summary income data onto banks of telephone numbers used 
in RDD sampling. The gains achievable through stratification 
appear quite limited. 

An examination of more detailed tables (not shown) 
indicates that the effectiveness is about the same for various 
types of geographic breakdowns, e.g., states, large or small 
MSAs, central cities, suburban areas, and nonmetropolitan 
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living in the stratum in 1990 
$ < 150% $ < Poverty $< 125% $ < 150% 
of Poverty of Poverty of Poverty 
1.8 31873) 22.4 15.4 
S)o// Pypep) 19:7 16.7 
16.8 22.8 252 24.8 
LO 10.7 14.4 16.8 
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1397 Ey) 4.8 6.7 
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areas. Conclusions drawn from this analysis will thus 
approximately apply to subnational surveys. 

However, geographic-based oversampling is an extremely 
effective tool for the low-income black and Hispanic 
populations. As shown in Table 6, blacks and Hispanics 
living in poverty are highly concentrated and others living in 
poverty are not. The left-hand side of Table 6 indicates the 
distribution of the poor black, Hispanic, and other popula- 
tions across density strata defined in terms of poverty rates 
specific to the domain of interest. Interpreting one example 
from the left side, 32 percent of poor Hispanics lived in 1990 
in block groups where the poverty rate for Hispanics was over 
50 percent. The right hand side indicates the distribution of 
the poor black and Hispanic populations across density strata 
defined just in terms of the local concentrations of blacks or 
Hispanics without regard to income levels. Interpreting one 
example from the right side, 44.8 percent of poor Hispanics 
lived in 1990 in block groups where Hispanics constituted 
over 60 percent of the local population. From these numbers, 
we infer that over 90 percent of both poor blacks and poor 
Hispanics live in areas with above average concentrations of 
their respective racial/ethnic groups. This means that a 
sampling strategy that oversamples blocks with high black or 
Hispanic concentrations will automatically yield 
disproportionately large numbers of poor blacks and 
Hispanics. Furthermore, almost no poor blacks or poor 
Hispanics live in areas with low poverty rates for their groups. 
This stands in marked contrast to the patterns for poor people 
who are neither black nor Hispanic. It appears that many poor 
nonhispanic whites live in close proximity to more well-off 
whites, possibly because poverty tends to be a transitory 
phenomenon for them, or perhaps because they are retired and 
purchased their homes when they were in_ better 
circumstances. 
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Density stratum 
(Poverty rate in 1990 for 
persons of the indicated 
race/ethnicity within the 
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Table 6 
Residential Clustering of the Low Income Population by Race and Ethnicity 


Percentage of persons with the 
indicated race/ethnicity and income 
below the poverty line living in the 

stratum in 1990 


Percentage of persons with the 
indicated race/ethnicity and income 
below the poverty line living in the 

stratum in 1990 


Density stratum 
(Indicated minority as a 


block group in 1990) aie 
Blacks Hispanics Others 

<5% 0.6 0.6 10.4 
5-10% 2.2 2.4 19.6 
10-20% 8.8 11.0 32.6 
20-30% 13.8 17.0 18.1 
30-40% 17.0 193 9.0 
40-50% eS 17.7 4.6 
50-100% 40.4 32.0 5.6 
Total populations (1000s) 8,557 5,536 17,975 


Sources: 1990 Decennial Census (Westat tabulation of STF-3) 


10. SIMULTANEOUS OVERSAMPLING 
OF SEVERAL 
RACE-ETHNIC DOMAINS 


In general, geographic-based oversampling can be used as 
easily and effectively for targeting multiple race-ethnic 
domains as for a single race-ethnic domain. In fact, the 
optimal sampling rates for the strata with high concentrations 
of each of the targeted domains will be about the same as if 
only it were being targeted. However, the overall level of 
screening will be increased since the number of areas with 
high sampling rates will increase with the number of targeted 
domains. Both these observations are due to the limited 
overlap between the highly segregated areas of the examined 
racial and ethnic minorities. 


percent of 1990 block 
in 1990) Domain 

Blacks Hispanics Others 
<5% 4.0 4.6 n/a 
5-10% 3h] 3)5)| n/a 
10-30% SE 19.9 n/a 
30-60% 19.0 DSS n/a 
60-100% 60.0 44.8 n/a 
Total populations (1000s) 8,557 5,536 17,975 


Table 7 presents some data on this subject from the 1990 
Decennial Census. The only domains that overlap signifi- 
cantly in their concentrated areas are Hispanics and Asians 
and Pacific Islanders, and even that overlap only works one 
way. Since there are so many more Hispanics in the U.S. than 
Asians and Pacific Islanders, the proportion of Hispanics that 
live in blocks with Asian /Pacific Islander populations over 
10 percent of the local population is only 13.7 percent while 
the percent of Asians and Pacific Islanders that live in blocks 
with Hispanic populations over 10 percent of the local 
population is a high 40.8 percent. The practical significance 
of this particular overlap is probably slight, however, since it 
would take such a large screening sample (both in and out of 
highly concentrated areas) to find enough Asians and Pacific 
Islanders to meet moderate precision requirements that such 


Table 7 
Residential Mixing of Minorities 


Percentage of blacks living 


Percentage of Hispanics living 


Percentage of Asians and Percentage of American Indians, 


ae in the stratum in 1990 in the stratum in 1990 Pacific Islanders living in 1990 Eskimos and Aleuts living in 1990 
(Indicated Stratification domain Stratification domain Stratification domain Stratification domain 
minority as : - : - - : 
a percent of see eee Asian American American Asian 
1990 block Hi 3 an ian, and Indian, : : Indian, : ; and 
in 990)" pispanie Pacific Eskimo Buck Pacific Eskimo rae a Sl Eskimo Blea uispaic Pacific 
Islander and Aleut Islander and Aleut and Aleut Islander 
< 10% 79.2 95.4 99.6 73.4 86.3 78.9 59.2 99.6 85.9 81.4 95.1 
10-30% 17 3.8 0.3 L525) 10.7 15.2 26.9 0.4 8.2 1253) 3.9 
30-60% 5.8 0.7 0.0 7.4 OS 4.2 10.8 0.0 3.3 4.5 0.8 
60-100% 22 0.1 0.0 3.6 0.5 1.6 a2 0.0 2S 1.8 0.2 
Sources: 1990 Decennial Census (Westat tabulation) 
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a screening sample would probably find enough Hispanics 
without resorting to disproportionate allocation of the sample 
to blocks with higher concentrations of Hispanics. 


11. CONCLUSIONS 


For household surveys in the U.S., geographic-based 
oversampling using data from the most recent decennial 
census is a useful sampling strategy for improving the 
precision of statistics about the black and Hispanic 
populations provided that the cost of full interviews is less 
than 5 to 10 times the cost of screener interviews. It is also a 
useful strategy for improving the precision of statistics about 
the Asian/Pacific Islander and American Indian/Eskimo/Aleut 
populations, even at very high ratios of the cost of full 
interviews to the cost of screener interviews. 

However, this does not mean that a survey of reasonable 
cost can be designed to simultaneously provide highly precise 
Statistics about all these domains while maintaining desired 
precision levels for the total population. Most demographic 
surveys require reasonable precision for both targeted 
domains and for the total population. Shifting some portion 
of the full interviews from the white nonhispanic population 
to the other domains is bound to decrease the precision of 
statistics about the total population. It is generally useful to 
strike a balance between precision attained for subpopulations 
and the total population. The point of this observation is 
merely that geographic-based oversampling does not obviate 
the need to select very large samples and conduct many 
screening interviews when trying to obtain precise statistics 
about rare domains at the lowest possible cost. Furthermore, 
precise statistics about rare domains will continue to be 
expensive even when using geographic-based oversampling. 

For surveys of low-income persons, only small gains are 
possible with geographic-based oversampling, and those only 
when the cost of a full interview is only a few times larger 
than the cost of screening and dropping a household. Most of 
these gains are likely to disappear when deterioration over 
time is taken into account. In fact, by the middle of a decade 
or later, when Census data become seriously outdated, there 
is the distinct possibility that geographic-based oversampling 
could reduce efficiency rather than improve it because of 
migration of the poor and sampling error in measuring 
poverty at the block group level. Geographic-based 
oversampling is a useful tool, however, when the focus of 
interest is on the black or Hispanic poor. 
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A Modified Random Groups Standard Error Estimator 


WILLARD C. LOSINGER' 


ABSTRACT 


The standard error estimation method used for sample data in the U.S. Decennial Census from 1970 through 1990 yielded 
irregular results. For example, the method gave different standard error estimates for the “yes” and “no” response for the 
same binomial variable, when both standard error estimates should have been the same. If most respondents answered a 
binomial variable one way and a few answered the other way, the standard error estimate was much higher for the response 
with the most respondents. In addition, when 100 percent of respondents answered a question the same way, the standard 
error of this estimate was not zero, but was still quite high. Reporting average design effects which were weighted by the 
number of respondents that reported particular characteristics magnified the problem. An alternative to the random groups 
standard error estimate used in the U.S. census is suggested here. 


KEY WORDS: Census; Variance estimation; Random groups; Design effect. 


1. INTRODUCTION 


During the 1990 Decennial Census, all respondents were 
asked to provide information on certain data items (called 
100-percent data). Most respondents provided this 
information on the census short form. In addition, a 
systematic sample (ranging from one-eighth to one-half, but 
averaging about one-sixth) of respondents provided 
information for more data items (sample data) on the census 
long form. 

Rather than providing standard error estimates for each 
published sample data estimate, the Census Bureau published 
tables of generalized design effects. For any sample data 
estimate, data users were instructed to create a standard error 
assuming simple random sampling (either using the standard 
formula or from a table) and a one-in-six sampling rate. 
Then, data users were to multiply this standard error by a 
generalized design effect (provided in another table). The 
table of generalized design effects listed design effects by 
data item type and percent of persons or housing units 
included in the sample (Table 1 provides the design effects 
published for 1990 U.S. census sample data for Vermont). 
For example, for all published sample estimates that dealt 
with occupation, a data user would find four generalized 
design effects for occupation: one for each of four sampling 
rate categories for persons in the report. To estimate the 
standard error for the number of teachers in a published 
report, a data user would multiply the simple-random- 
sampling standard error (assuming a one-in-six sampling rate, 
derived from the formula or table of standard errors) by the 
design effect for occupation data items for the reported 
sampling rate. The data user could then use the estimated 
number of teachers and standard error to construct a 
confidence interval. More details on the use of the table of 
design effects are available in the Accuracy of the Data 


section for any sample data product (U.S. Bureau of the 
Census 1993, for example). 


2. ESTIMATION OF STANDARD ERRORS 


A random-groups approach was used to estimate standard 
errors for the census sample data. The United States was 
divided into just over 60,000 distinct areas (called weighting 
areas--areas for which sample weights were derived). For 
each weighting area, sample units (a sample unit being either 
a housing unit or a person residing in a group quarter) were 
assigned systematically among 25 random groups. Thus, it 
was thought that each random group so formed met the 
requirement of having approximately the same sampling 
design as the parent sample (Wolter 1985). 

For each of the 25 random groups, a separate estimate of 
the total for each of 1,804 sample data items was computed by 
multiplying the weighted count for the sample data item 
within the random group by 25. For each data item for which 
the total number of people with a particular characteristic was 
estimated from the sample data, the random-groups standard 
error estimate was then computed from the 25 different 
estimates of the total from the random groups: 


where n represents the unweighted number of persons in the 
sample within the weighting area; N represents the census 
count of persons within the weighting area; y. represents the 
estimate of the total for the data item achieved by multiplying 
the weighted count for the data item within the i-th random 
group by 25; and Y is the weighted count for the data item 
(i.e., the sample estimate) within the weighting area. 


' Willard C. Losinger, U.S. Department of Agriculture: APHIS: VS, CEAH, 555 South Howes Street, Suite 200, Fort Collins, CO 80521, U.S.A. 
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Table 1 
Design Effects Published for 1990 U.S. Census 
Sample Data for Vermont 


Percent of persons or housing 
units in sample 


Characteristic 15 30 

< 15% 30% 45% > 45% 
Re 10 06 0.5 
Sex a Re oe 
Race se ae nn 


Hispanic origin (of any race) 
Marital status 

Household type and relationship 
Children ever born 

Work disability and mobility 


1672 
i272 
i\e72 
2 1.0 0.6 0.5 
eH 
12 
2:5) 


limitation status 12 1.0 0.6 0.5 
Ancestry 1.8 eS 1.0 0.8 
Place of birth I) 1.6 1.0 0.9 
Citizenship Lh 1.4 1.0 0.8 
Residence in 1985 1:9 1.7 1.0 0.9 
Year of entry 1.3 1.0 0.6 0.5 
Language spoken at home and ability 

to speak English 1.6 1.3 0.9 0.7 
Educational attainment Hes} 1.1 0.6 0.5 
School enrollment 1.6 1.4 1.0 0.8 
Type of residence (urban/rural) 1.7 Ne7/ 1.4 1.4 
Household type ioe 1.0 0.6 0.5 
Family type 1.1 1.0 0.6 0.5 
Group quarters 1.0 1.1 0.9 0.8 
Subfamily type and presence of 

children ii 0.9 0.5 0.5 
Employment status v2 1.0 0.6 0.5 
Industry 12, 1.0 0.6 0.5 
Occupation 12 1.0 0.6 0.5 
Class of worker Ab: 1.0 0.6 0.5 
Hours per week and weeks worked 

in 1989 1.4 11.72 0.7 0.6 
Number of workers in family 1.3 1.1 0.7 0.6 
Place of work 1.4 V2 0.8 0.6 
Means of transportation to work 1.4 ie 0.7 0.6 
Travel time to work 1.3 1.1 0.6 0.5 
Private vehicle occupancy 1.4 le 0.7 0.6 
Time leaving to go to work 2 1.0 0.6 0.5 
Type of income in 1989 1.3 ibe 0.6 0.5 
Household income in 1989 1.1 1.0 0.6 0.5 
Family income in 1989 Hl 1.0 0.6 0.5 
Poverty status in 1989 (persons) ES 1.2 0.7 0.7 
Poverty status in 1989 (families) 1.1 0.9 0.5 0.5 
Armed forces and veteran status 1.4 ital 0.7 0.6 


Source: U.S. Bureau of the Census (1993). 1990 Census of 
Population: Social and Economic Characteristics: Vermont. 
Report Number 1990 CP-2-47. Page C-11. 


A standard error based upon simple random sampling and 
a One-in-six sampling rate was computed thus: 


Sas = V5 ¥ (1 - YIN) 
developed from standard formulas displayed in Cochran 
(1977). 
For each data item within the weighting area, a design 
effect was computed as the ratio of the Sp, to Sgpg: 


For a state report of sample data, the design effects for each 
data item were averaged across the weighting areas in the 
state. Then, a generalized design effect for each data item 
type (for example, all data items that dealt with occupation) 
was computed. The generalized design effect was weighted 
in favor of data items that had higher population estimates. 
Details on most of the procedures followed are available in a 
Census Bureau document (U.S. Bureau of the Census 1991). 
The same basic method was also used for sample data 
products in both the 1970 and 1980 census. 


3. A HYPOTHETICAL EXAMPLE OF 
RANDOM GROUPS 


Table 2 presents a hypothetical example of data that might 
have arisen from the random-groups method. For a weighting 
area in Vermont, weighted counts of whites and blacks are 
listed for the 25 random groups. In this hypothetical 
weighting area, there are no persons of other race. The 
standard errors assuming simple random sampling are the 
same for whites and blacks (as one would expect for a 
binomial variable). However, S,, is much higher for the 
estimate of whites than the estimate of blacks. And, the 
design effect is nearly five times higher for the estimate of 
whites than the estimate of blacks. Since the generalized 
design effect computed for groups of data items was weighted 
in favor of data items that had higher population estimates, 
the generalized design effect computed for race for the state 
of Vermont was quite high. 

Data on race were frequently included in 1990 U.S. census 
sample data products. Because race was asked of every 
census respondent (i.e., it was a census 100-percent data 
item), and because the weighting process used by the Census 
Bureau effectively forced the sample estimates by race to 
match the 100-percent Census counts by race, the standard 
errors for estimates of race probably should have been 
considered to be zero. However, generalized design effects 
were still published by race, although set to arbitrary 
constants for all reports (rather than as computed by this 
method). 


4. A MODIFIED APPROACH TO THE RANDOM 
GROUPS METHOD 


A slight modification of the random groups method 
(essentially applying a ratio-estimation technique) can achieve 
much more satisfactory results in the estimation of standard 
errors. Rather than using Y , as defined above for the estimate 
of the total for the 7-th random group, one could instead use 


[,=NX,/W, 
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Table 2 
Hypothetical example of data that could have resulted 
from the Random Groups method used to estimate 
standard errors for census sample data. 
For a weighting area in Vermont, people are asked their race. 
A few (110) are black; most (2,518) are white. 
A sampling rate of one-in-six is assumed (N = 2,628, n = 438). 


Weighted . Total weighted 
Random Group count De race ie population 

of blacks* count # 

1 10 90 100 

2 0 100 100 

3 0 110 110 

4 0 140 140 

5 5 70 75 

6 8 50 58 

Ul 12 103 115 

8 20 60 80 

9 0 65 65 


—_— 
-— © 
i) 
Nn oS 
nan Oo 
—_— 
N 
a & 


12 0 130 130 
13 10 90 100 
14 0 100 100 
15 0 110 110 
16 0 140 140 
17 5 70 15 
18 8 52 60 
19 12 103 115 
20 20 160 180 
21 0 65 65 
22 0 100 100 
23 0 125 125 
24 0 130 130 
25 0 130 130 
Sum of weighted 
counts (Y) 110 2,518 2,628 
Spo 145.98 687.96 
crs 22.96 22.96 
F 6.36 29.96 


* The first 25 figures in this column represent X, for the i-th 
random group under the modified random groups method. 
Multiplying the figure by 25 yields rs for the random groups 
method employed by the U.S. Bureau of the Census. 

# The first 25 figures in this column represent W, under the 
modified random groups method. 
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where X, represents the weighted count for the data item 
within the i-th random group, W, is the weighted count of all 
persons in the i-th random group, and N represents the census 
count of persons in the weighting area. The modified random 
groups standard error estimate is then 


Using this method, S, is 160.78 for both blacks and whites 
in the hypothetical weighting area of Table 1 (close to the 
value of S,,, for blacks). In this case, the requirement for 
standard error estimates for both responses for a binomial 
variable to be identical is met. Moreover, if all sample units 
have the same response for some variable, S, becomes zero, 
whereas S,,, only becomes zero when each random group has 
the same weighted count. 

This modified standard error estimation procedure could 
be useful for researchers who do not have access to any of the 
many computer programs now available for computing 
estimates from sample data (such as SUDAAN, STATA, 
PC-CARP, VPLX, efc.). In addition, the U.S. Bureau of the 
Census ought to consider modifying its approach for 
estimating standard errors for sample data from the 2000 
census. Moreover, with the U.S. Bureau of the Census’ 
current emphasis on quality management, the U.S. Bureau of 
the Census may wish to poll users of sample data products to 
determine how useful the presentation of standard errors 
(through design effects) was to them, and involve a number 
of the data users in improving the presentation of standard 
errors for the next census. 
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A Simple Derivation of the Linearization of the 
Regression Estimator 


KEES ZEELENBERG' 


ABSTRACT 


We show how the use of matrix calculus can simplify the derivation of the linearization of the regression coefficient 


estimator and the regression estimator. 


KEY WORDS: Matrix calculus; Regression estimator; Taylor expansion. 


1. INTRODUCTION 


Design-based sampling variances of non-linear statistics 
are often calculated by means of a linear approximation 
obtained by a Taylor expansion; examples are the variances 
of the general regression coefficient estimator and the re- 
gression estimator. The linearizations usually need some 
complicated differentiations. The purpose of this paper is to 
show how matrix calculus can simplify these derivations, to 
the extent that even the Taylor expansion of the regression 
coefficient estimator can be derived in one line, which should 
be compared with the nearly one page that Sarndal et al. 
(1992, p. 205-206) need. To be honest, the use of matrix 
calculus requires some more machinery to be set up, which is 
not needed for traditional methods. However this set-up can 
be regarded as an investment; once it has been learned, it can 
be used fruitfully in many other applications. After this paper 
had been written, Binder (1996) appeared, in which similar 
techniques are used to derive variances by means of 
linearization. The present paper can be seen as a pedagogical 
note, in which the use of differentials is exposed. 


2. MATRIX DIFFERENTIALS 


2.1. Introduction 


We will use the matrix calculus by means of differentials, 
as set out by Magnus and Neudecker (1988); this calculus 
differs somewhat from the usual methods, which focus on 
derivatives instead of differentials. Therefore in this section 
we will briefly describe the definitions and properties of 
differentials (see Zeelenberg 1993, for a more extensive 
survey). We first define differentials for vector functions, 
and then generalize to matrix functions. 


2.2 Vector Functions 


Let fbe a function from an open set Sc R” to R”; let x, 
be a point in S. The function fis differentiable at x, if there 


exists a real n x m-matrix A, depending on x,, such that for 
any u € R” for which x, + ue S, there holds 


SXp + U) = f%) + A, u + Ou), (1) 


where o(u) is a function such that lim) ,)9| 0(%)|/| | = 0; the 
matrix A is called the first derivative of fat x,; it is denoted as 
Df(x,) or OfIA(x') |, ox The derivative Df is equal to the 
matrix of partial derivatives, i.e., Df\(x),, = of, /ox,. The linear 
function df. R” > R” defined by df. und, wu ‘is called the 
differential of f at x). Usually we write dx instead of u so that 
df. (de) = AR (&. From (1) we see that the differential 
corresponds t to the linear part of the function, which can also be 
written as 


Y ~Vo =A, (& - Xp), 


where y, =/(x,). Therefore the differential of a function is the 
linearization of the function: it is the equation of the 
hyperplane through the origin that is parallel to the hyperplane 
tangent to the graph of fat x,; so the linearized function can 
be written as 


fx) = fey) + Ay ( ~ Xp). (2) 


Alternatively, if B is a matrix such that Ff. (dx) = Bdx, then 
B is the derivative of fat x, and contains the partial derivatives 
of fat x). This one-to-one relationship between differentials 
and derivatives is very useful, since differentials are easy to 
manipulate. 

Finally, we usually omit the subscript 0 in x,, so that we 
write df= A dx. 


2.3 Matrix Functions 


A matrix function F from an open set Sc R””” to R?*? is 
differentiable if vec F is differentiable. The derivative DF is 
the derivative of vec F with respect to vec X, and is also 
denoted by 0 vec F/d(vec X)’. The differential dF is the matrix 
function defined by vec dF x(Y) =A x,vec U. 


' Kees Zeelenberg, Department of Statistical Methods, Statistics Netherlands, P.O. Box 4000, 2270 JM Voorburg, The Netherlands. 
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2.4 Properties of Differentials 


Let A be a matrix of constants, F and G differentiable 
matrix functions, and a a real scalar. Then the following 
properties are easily proved: 


dA =0, (3) 
d(aF) =adF, (4) 

d(F + G) =dF + dG, (5) 
d(FG) = (dF)G + F(dG), (6) 
dF l= GP)F (7) 


The last property can be proved by taking the differential 
of FF “' =] and rearranging. 


3. LINEARIZATION OF THE REGRESSION 
COEFFICIENT ESTIMATOR 


The z-estimator (Horvitz-Thompson estimator) of the 
finite population regression coefficient (cf Sarndal et al. 
1992, section 5.10) is 


i (8) 


where 


oS ps 


kes Tl, 


y, is the variable of interest for individual k, x, is the vector 
with the auxiliary variables for individual k, 7, is the inclusion 
probability for individual k, and s denotes the sample. 
Taking the total differential of (8), using properties (6) and 
(7), and evaluating at the point where T = Ts t= t, we get 


A 


GB =eT. dt) T1437, (at). (9) 
Because of the connection between differentials and linear 
approximation, as given in equation (2), it immediately 


follows that (9) corresponds to the linearization of the 
regression coefficient estimator: 


BEB-T GT -T\VE tal Cheba 6B 


where B = T't. 


4. LINEARIZATION OF THE REGRESSION 
ESTIMATOR 


The regression estimator of a population total is (cf Sarndal 
et al. 1992, section 6.6) 


fie en (Pal p18 22 (10) 
where ¢ , 18 the n-estimator of the variable of interest, ¢, is the 
vector with the population totals of the auxiliary variables, f, . 
is the vector with the m-estimators of the auxiliary variables, 
and B is the estimator of the regression coefficient of the 
auxiliary variables on the variable of interest. Taking the total 
differential of (10), using properties (3) and (6), and evaluating 
at the point where ie =f its = t., and B = B, we get the linear 


, ; ek ek oe? ‘ 
approximation of the regression estimator 


dt, = di. - (di,,)' B, 


yr 


so that 


that eke te 
Lait Scat BE (ERSTE SEN 44 


x ty) B. 
Note that for the linearization of the regression estimator we do 
not need that of the regression coefficient estimator B. 
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Layout 


Manuscripts should be typed on white bond paper of standard size (8% x 11 inch), one side only, entirely double spaced 
with margins of at least 1% inches on all sides. 

The manuscripts should be divided into numbered sections with suitable verbal titles. 

The name and address of each author should be given as a footnote on the first page of the manuscript. 
Acknowledgements should appear at the end of the text. 
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Abstract 


The manuscript should begin with an abstract consisting of one paragraph followed by three to six key words. Avoid 
mathematical expressions in the abstract. 


Style 


Avoid footnotes, abbreviations, and acronyms. 

Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as “exp(-)” and 
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Short formulae should be left in the text but everything in the text should fit in single spacing. Long and important equations 
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to later. 

Write fractions in the text using a solidus. 

Distinguish between ambiguous characters, (e.g., w, @; 0, O, 0; 1, 1). 
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Figures and Tables 


All figures and tables should be numbered consecutively with arabic numerals, with titles which are as nearly self 
explanatory as possible, at the bottom for figures and at the top for tables. 

They should be put on separate pages with an indication of their appropriate placement in the text. (Normally they should 
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after the reference, e.g., Cochran (1977, p. 164). 
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In This Issue 


This issue of Survey Methodology contains articles on a variety of topics. Kott and Stukel 
consider jackknife variance estimation for a specific, but widely used two-phase design. At the first 
phase, clusters within strata are selected using SRS with replacement, and all units within the 
selected clusters are sampled. At the second phase, the sampled units are restratified and then second 
phase units are selected using SRS without replacement. Two point estimators are considered: the 
“reweighted expansion estimator” and the more commonly known “double expansion estimator”. 
Under this design, it is shown that the jackknife variance estimator behaves remarkably better for 
the former point estimator than it does for the latter. A Monte Carlo study supports these findings. 

Decaudin and Labat describe a “multi-source” population estimation system designed to produce 
local population estimates during intercensal periods in France. The system is robust and flexible 
in that it works with a variable number of sources. It is based on a robust combination of estimates 
from different sources, blending demographic reasoning with statistical methods. 

Ravalet applies GM-estimators to INSEE’s industrial investment survey with an adaptive 
procedure to produce a robust estimator. Tukey’s biweight function and the Cauchy function are 
examined. Each function relies on a tuning constant based on the width of the tail of the distribution 
and the concentration of the residuals. Tuning constants that minimize the estimator’s variance are 
determined for eight distributions representing various scenarios relating to the width of the tail and 
the concentration of the residuals, which are assumed to be symmetrical. 

Cotton and Hesse study the characteristics of various methods of selecting a stratified panel of 
fixed size, along with their impact on initial selection, rotation, resampling and sample overlap. The 
authors propose a kind of algorithm based on transformations of permanent random numbers used 
for sampling purposes; the algorithm extends the pre-resampling rotation into the post-resampling 
period. The transformations can be performed on random numbers that have been made equidistant 
and on random numbers derived from a uniform distribution. 

In his paper Farrell studies empirical Bayes estimation of small area proportions. Using data from 
the United States Census he compares empirical Bayes small area estimates of proportions of 
individuals in different income categories based on multinomial and ordinal logistic models with 
random effects. Inferences based on the ordinal model were slightly better than those based on the 
multinomial model. He also compares naive and bootstrap adjusted variance estimates and coverage 
probabilities of their associated confidence intervals. The bootstrap adjustment improves coverage 
significantly. 

Gelman and Little describe a novel extension of analyzing poststratified survey data, using 
Bayesian hierarchical logistic regression modelling. The technique allows for many more 
stratification categories than are typically feasible using standard poststratification and weighting 
strategies, and thus much more population level information can be included in the model. The 
proposed method as well as some of the more standard methods are applied to pre-election opinion 
polling data in the U.S., and the various models are evaluated graphically by comparing them to 
actual election outcomes. 

Singh, Tsui, Suchindran and Narayana describe the survey design and estimation techniques used 
for PERFORM (Project Evaluation Review for Organizational Resource Management), a large scale 
survey conducted in the state of Uttar Pradesh in India. The survey was designed to estimate the 
characteristics of health facilities and their target populations, in order to provide benchmark 
indicators for a large family planning project. PERFORM uses a stratified multi-stage design, where 
the ultimate sampling units are households and eligible females residing within. However, estimates 
of health facilities, which are not explicitly part of the sampling scheme, are also obtained by 
adjusting for multiplicity of the selected secondary sampling units served by those health facilities. 
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In This Issue 


Dufour, Kaushal and Michaud review the tests and studies that preceeded the implementation of 
computer-assisted interviewing for most household surveys at Statistics Canada. The interviewing 
is conducted, in person at the respondent’s home or by telephone from the interviewer’s home, using 
laptop computers. They also discuss the challenges that were faced with the implementation of the 
new technology into ongoing surveys and the new opportunities for monitoring survey collection 
offered by it. 

Scheuren and Winkler propose a method for using noncommon but correlated quantitative 
variables to improve record linkage. The basic idea is to use the linkages which are almost certainly 
correct to estimate a regression relationship between the noncommon variables and then to use the 
predicted values of these variables in a subsequent record linkage step. The procedure can then be 
iterated until convergence. The regression step uses a procedure which adjusts the regression for 
possible errors in the linkage, described in an article by the same authors in the June 1993 issue of 
Survey Methodology. The method is illustrated empirically and it is shown that it can lead to good 
results in situations that were hitherto hopeless. 


The Editor 


Dear Survey Methodology Reader, 


I would like to take a moment to thank you for your interest and support of Survey Methodology. 
Since its inception, the journal remains committed to publishing articles relevant to statistical 
agencies and researchers with emphasis on the development and evaluation of specific 
methodologies as applied to data collection or to the data themselves. 

Survey Methodology is approaching its 25th anniversary. From its beginning as an in-house 
review of developments in survey methodology in Statistics Canada, it has evolved into a widely 
read statistical journal with an editorial board of internationally recognized survey statisticians. 
Though many improvements to content and presentation have occurred during this period, there is 
always room for improvement. I would appreciate any suggestions, comments and recommendations 
you may have to assist us in our task of maintaining Survey Methodology as a viable platform for 
statistical development into the next millennium. 

Should you wish to have complimentary copies of Survey Methodology sent to a colleague, please 
do not hesitate to contact us. 

I thank you again for your interest and continued support of Survey Methodology. 


Sincerely, 


M.P. Singh 
singhmp @ statcan.ca 


Survey Methodology, December 1997 
Vol. 23, No. 2, pp. 81-89 
Statistics Canada 


81 


Can the Jackknife Be Used With a Two-Phase Sample? 


PHILLIP S. KOTT and DIANA M. STUKEL’ 


ABSTRACT 


The jackknife variance estimator has been shown to have desirable properties when used with smooth estimators based on 
stratified multi-stage samples. This paper focuses on the use of the jackknife given a particular two-phase sampling design: 
a stratified with-replacement probability cluster sample is drawn, elements from sampled clusters are then restratified, and 
simple random subsamples are selected within each second-phase stratum. It turns out that the jackknife can behave 
reasonably well as an estimator for the variance for one common “expansion” estimator but not for another. Extensions 
to more complex estimation strategies are then discussed. A Monte Carlo study supports our principal findings. 


KEY WORDS: Stratified; Reweighted expansion estimator; Double expansion estimator; Asymptotic. 


1. INTRODUCTION 


Krewski and Rao (1981) and Rao and Wu (1985) 
explore the design-based properties of the jackknife 
variance estimator given a stratified multi-stage sample 
incorporating with-replacement sampling in the first stage. 
Their results, although fairly general, cannot be directly 
applied to many multi-phase sampling designs. See also 
Wolter (1985; Chapter 4.5). 

In this paper, we consider a simple example of two-phase 
sampling. A stratified with-replacement probability cluster 
sample is selected in a first phase of sampling. The 
elements in sampled clusters are then restratified, perhaps 
using information gathered from the first-phase sample, and 
a stratified simple random subsample is drawn without 
replacement. 

One can estimate a total without auxiliary information in 
one of two ways. In the double expansion estimator — called 
“the 2° estimator” in Sarndal, Swensson, and Wretman 
(1992, p. 347) — the value of each subsampled element is 
simply multiplied by the product of its expansion factor at 
each phase (i.e., the inverses of its first-phase and second- 
phase selection probabilities) and then summed. 

Although the double expansion estimator is more easily 
located in text books, the reweighted expansion estimator 
may be more common in practice, especially when element 
nonresponse is treated as a second phase of sampling, as in 
the weighting class estimator of Oh and Scheuren (1983, 
p. 150). An estimator for the population size of each 
second-phase stratum is computed by summing the first- 
phase expansion factors of all the elements in the second- 
phase stratum before subsampling. This value is then 
multiplied by the estimated second-phase stratum mean 
based on the subsample to yield an estimated stratum total. 
The second-phase estimated stratum totals are finally added 
together to produce the reweighted expansion estimator for 
the population total. 

We are more concerned here with real two-phase 
sampling, rather than the artifice of treating nonresponse as 


an additional sampling phase. The National Agricultural 
Statistics Service (NASS) presently uses the double 
expansion estimator in its Quarterly Agricultural Surveys 
(QAS). A stratified area cluster sample is enumerated in 
June. Farms identified in the June survey are restratified 
based on their June responses and then subsampled for 
enumeration in September, December, and March. 

NASS uses a two-phase design and the reweighted 
expansion estimator for its on-farm chemical use surveys. 
The first phase of sampling identifies farms with specific 
crops, and the second phase measures pesticide use on 
those crops. 

This paper shows that although the jackknife may be 
used to estimate the variance of the reweighted expansion 
estimator under certain conditions, it is not generally 
effective as a variance estimator for the double expansion 
estimator. Section 2 introduces the reweighted expansion 
estimator and discusses its mean squared error. Section 3 
shows that the jackknife variance estimator can be nearly 
unbiased for the reweighted variance estimator, while 
Section 4 addresses the jackknife’s failings as a variance 
estimator for the double expansion estimator. Section 5 
describes a simulation study that appears to confirm the 
main assertions of the previous sections. Section 6 
discusses extensions of the reweighted expansion estimator, 
and Section 7 offers some concluding remarks. An 
appendix provides an outline of our assumed asymptotic 
framework and some proofs. 


2. THE REWEIGHTED EXPANSION 
ESTIMATOR 


2.1 The Estimator 


Let h(=1,...,H) denote the first-phase strata of a 
stratified with-replacement probability cluster sample, 1, 
the number of sampled clusters in stratum A, and F,, the set 
of those clusters. Let g(=1,...,G) be the second-phase 


' Phillip S. Kott, National Agricultural Statistics Service, 3251 Old Lee Highway, Room 305, Fairfax, VA 22030; Diana M. Stukel, Household Survey Methods 
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strata from which a stratified simple random subsample is 
drawn without replacement. An element in a cluster 
sampled p times in the first phase is treated as p distinct 
elements for the subsample. Let M, be the number of 
elements in g before subsampling and m, the number of 
subsampled elements in g. In practice, the G second-phase 
strata are often not defined until after the first-phase sample 
has been drawn. 

Let S, be the set of elements in g before subsampling, s, 
the set of subsampled elements in g, s the entire set of 
subsampled elements, and m = Ym, the subsample size. 
Finally, let y, be the value of interest for element i, and w, 
the first-phase expansion factor for i (i.e., the inverse of the 
selection probability for the cluster containing /). 

The estimator for the population total, 7, one would use 
if all the elements in the first-phase sample were 
enumerated can be written as 


G 
t=) Dd w,y,. (1) 


aN TEN, 


Let the reweighted expansion estimator for T be: 


g se 2 (Mim, )w,y, 

t = eG ee a ee 

g=1 ies, Py (M, /m oY 
IES, 


An alternative expression for 7, is 
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is the adjusted weight for element i. Equation (3) is what 
gives the reweighted expansion estimator its name. 


2.2 Its Mean Squared Error (Some Theory) 


Now f, is not, in general, an unbiased estimator of 7. 
Nevertheless, under certain mild conditions specified in the 
appendix, it is a design consistent estimator for 7; that is, 
plim,,...(, - T)/T =O (Isaki and Fuller 1982). For the 
exposition in the text, it suffices to say that the m g are 
assumed to be large. 

Observe that 


El 7 T)’] 5 El({t, is i v 1b be t})7] 
oi Var, (t,) a E, {E, [(, ay ty} ’ 


where the subscripts on Var and £ denote the phase of 
sampling. Since the mg are assumed to be large, 
EGG i) SG eae 8) User cAlsomme ey oe 
E,[E,(t, - T)] = 0, and the mean squared error of ¢, is 
effectively its (asymptotic) variance. 

Since first phase of sampling was conducted with 
replacement, Var, (t,) can, in principle, be estimated by 


H 
Me », leo) 


1” > calle p> De worplny) (4) 
where U,. is the set the elements in sampled cluster 7 of 
first-phase stratum h. The subscript ZL denotes “lineari- 
zation” for historical reasons although there is nothing to 
linearize in this context. Note that when there is a second 
phase of sampling, it will generally not be possible to 
compute v,, in practice. 
Now 


JEF;, |t€U,, JEF;, 1€U;, 


where 
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It is crucial for the arguments Blot to realize that r, has 
been defined so that )’,.. w,r; = 0 for all g. 

Continuing, 3 
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since dies, Ww, ee (M,/m,)w, (see equation (A1) of the 
appendix): This implies 
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Observe that equation (6) does not ignore the finite 
population corrections from the second phase of sampling. 


3. THE JACKKNIFE VARIANCE ESTIMATOR 


3.1 The Variance Estimator 


We are now ready to discuss the jackknife. For j € F,, 


define the jackknife replicate tiny) aS 


phe Wryid i i 


i€S, 
Linjy2 >» x Whi s Wy , (7) 


ies, 


where 
wn,/(n,- 1) when ie Uy,» and j' #j 
Wags = O when ie U,, 
w, when i€U,,, and h' +h. 


Similarly, we define 


Linjy1 >» x Wri dir 


Following Rust (1985), the jackknife variance estimator, 
v fF = | or 2), is defined here simply as 


H 
Wieco SalG gal a PMU eam (8) 
h=1 JeF, 


This form is labeled v\ in Krewski and Rao (1981, 
equation (2.4)). It is easy to show that v,, = v,,. 
3.2 Why it Works (More Theory) 


We will soon see that v,, provides a nearly unbiased 
estimator for the variance of the reweighted expansion 
estimator in equation (2). Rao and Shao (1992) indirectly 
make the same claim (our equation (2) is the expectation of 
their estimator in Section 3.3, pp. 818-819). Their work, 
however, treats nonresponse as an additional phase of 
sample selection in which Poisson sampling (Sarndal et al. 
1992, p. 85) is used in place of stratified simple random 
sampling. Each first-phase sample element in the Rao 
and Shao (1992) setup is effectively a second-phase 
Stratum. Consequently, the near unbiasedness of v,, 
reduces to a special case of a result in Krewski and Rao 
(Rao and Shao 1992, p. 821). 

What we have called the second-phase strata are 
reweighting classes in the Rao and Shao (1992) setup. 
Elements in the same class are assumed to have the same 
unknown probability of selection/response. Conditional on 


83 


the realized subsample sizes within reweighting classes, 
Poisson sampling is equivalent to stratified simple random 
sampling. Rao and Shao’s (1992) treatment, however, is 
unconditional. 

Returning to the problem at hand, observe that 


be; WiyiDi i 


IES, TE 
oe & 
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where 
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Under mild conditions (see equations (A2) and (A3) in 
the appendix), we have the following analogue to equation 
(5): 
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where c, is an indicator variable equal to 1 when 7 is in the 
subsample and zero otherwise. 
Continuing, 
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where Ziyi = + 11M, /m gle; = 1}r,,, Again, since every 
m, is large, it is not unreasonable to assume Gas, (see 
equation (A4) in the appendix). Thus, 


G 
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where z, =; + {[M, /m gl; - 1}r,. Using similar argu- 
ments, ft, = ys ee w, Z. Since t, is linear in the z,, 
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Let e, = M,/m, be the second-phase expansion factor 
for i € S,. Observe that c, isa random variable with E(c,) = 
m,/M, and E(c,c,) = (m,/M,) (m, - 1)/(M, - 1) for 


i,ke So i#k. 
Now 
| »s wa i ( > wo)" > ye (e,- 1) (wr, 
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Similarly, letting F’, be the set of elements from selected 
clusters in the first-phase stratum h before subsampling, we 
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In the appendix, it is argued that under mild conditions that 
the last term in both equations (12) and (13) is negligible. 
As a result, 
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which in turn implies that v,, is a nearly unbiased estimator 
for E[(t, - TY’). 


4. THE DOUBLE EXPANSION ESTIMATOR 


An alternative to ¢,, the double expansion estimator, has 
the form: 


G 
t= >) (M,/m,)w,y;, 


g=l IES, 
The definition of a jackknife replicate for t, is unclear. One 
simple possibility is 


(15) 
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Another, perhaps more in the spirit of “replication”, is 
G 
toys = Dy (17) 
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where M_,. is the number of elements in the first-phase 
sample (i.e., in a cluster in the first-phase sample) that are 
in S_ but not U,,- Similarly, M op is the number of 
elements in the second-phase sample that are in s, but not 
U,,. Through counter-examples given in the appendix, we 
show that neither version of the replicate produces a 
jackknife variance estimator (v,, from equation (8)) that is 
asymptotically unbiased in general. 


5. A MONTE CARLO SIMULATION STUDY 
5.1 Design of the Study 


The results given so far in the text are asymptotic. In 
order to assess the accuracy of the jackknife as a variance 
estimator for the reweighted expansion estimator in a finite 
world, we undertook a Monte Carlo simulation study. At 
the same time, we assessed the accuracy of the two 
jackknife estimators suggested for the double expansion 
estimator in Section 4. 

We used December 1990 Canadian Labour Force Survey 
(LFS) sample data for the province of Newfoundland to 
simulate a finite population, from which repeated samples 
were drawn. The LFS is the largest ongoing household 
sample survey conducted by Statistics Canada. Monthly 
data relating to the labour market is collected using a 
complex multi-stage sampling design with several levels of 
stratification. The details of the design of the survey prior 
to the 1991 redesign can be found in Singh, Drew, 
Gambino and Mayda (1990) and Stukel and Boyer (1992). 
In general, provinces are stratified into “economic regions”, 
which are large areas of similar economic structure; 
Newfoundland has four such economic regions. The 
economic regions are further substratified into lower level 
substrata. The lowest level of stratification in 
Newfoundland yielded 45 strata, each of which contained 
less than 6 clusters or primary sampling units (PSU’s), 
which was an insufficient number from which to sample for 
the purposes of the simulation. Thus, the 45 strata were 
collapsed down to 18, each containing between 6 and 18 
PSU’s. In collapsing the strata, economic regions were kept 
intact, as were the Census Metropolitan Areas of St. John’s 
and Cornerbrook. 

For the Monte Carlo study, R = 4,000 samples were 
drawn from the Newfoundland “population” (which was 
9,152 individuals), according to the following two-phase 
design: within each first-phase stratum, two PSU’s were 
selected at the first phase using simple random sampling 
(SRS) with replacement. This yielded a total of 36 PSU’s. 
All households within selected first-phase PSU’s (as well 
as individuals within those households) were selected, 
resulting in a single-stage take-all cluster sample. At the 
second phase, all selected first-phase elements (individuals, 
treating each person in a PSU selected twice as two separate 
individuals) were restratified according to five age 
categories (< = 14, 15-24, 25-44, 45-64, > = 65), and 
second-phase sample elements (i.e., individuals) were 
drawn using SRS without replacement sampling within 
each of the five second-phase strata. 
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We varied the second-phase stratum sample size to take 
on values m, = 5, 10, 20, and 50 yielding overall second- 
phase sample sizes of m = 25, 50, 100, and 250. When the 
number of first-phase-sampled individuals in a second- 
phase stratum was less than our target m, value, we 
planned to set m, = M_, but that event never occurred. 

A popular rule of thumb for a “separate ratio estimator” 
such as the reweighted expansion estimator in equation (2) 
is that there should be at least 20 individuals within each 
second-phase stratum (see, for example, Sarndal, Swensson 
and Wretman 1992, p. 270). By allowing m, to be as small 
as 5 and 10, we are checking whether this rule is really 
necessary. 

We considered two parameters of interest: 7., the total 
number of employed, and 7. /7_ the employment rate. Here 

= Vieyy;, Where y, = 1 when individual i is employed; 
0) otherwise. Similarly, i= ys epzj> where’z, = 1 when 
individual i is in the labour force (i.e., either employed or 
unemployed); 0 otherwise. For each of the R = 4,000 
samples, we calculated the reweighted expansion estimator 
(REE), t,, given by equation (2), the double expansion 
estimator (DEE), ¢ 3 given by equation (15), and the full 
first-phase expansion estimator (FFPE), ¢, given by 
equation (1). Although these estimators are defined for 
totals (applicable for total number of employed), it is a 
simple matter to extend them to ratios of totals (applicable 
for employment rate). 

For each of the R = 4,000 second-phase samples, we 
calculated the jackknife variance corresponding to the 
reweighted expansion estimator and the double expansion 
estimator, given by equation (8) with f=2 and f=3 
respectively. In the case of the double expansion estimator, 
we attempted both the replicates defined in equations (16) 
and (17), which we will refer to as variant 1 and 2, 
respectively. 

For each of the R = 4,000 first-phase samples, we also 
calculated the jackknife variance corresponding to the full 
first-phase estimator for comparison purposes. This is 
given by equation (8) with f= 1. 

For all of the above estimators and their corresponding 
jackknife variances, a number of frequentist properties were 
investigated. These are given below. For simplicity, they 
are expressed only in terms of estimates of the total number 
of employed. 

The percent relative bias of the estimated number of 
employed with respect to the population value is estimated 
by 


PRB(t*) = {[E,,(t*/T,] - 1} x 100, (18) 
where 
4,000 
E(t") = (1/4,000) >t; 
r=] 


is the Monte Carlo expectation of the point estimator - 
taken over the 4,000 samples. Here t* can be either f,, 
or f,, and 7 is the value of t* for sample r. 

The percent relative bias of the jackknife variance 
estimator with respect to the true mean squared error is 
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estimated by 
PRB[v -(¢ ‘\] = 


(19) 


({E,[v,(t")] - MSE,,,,}/MSE,,,.) x 100, 


where 
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and v,,(¢*) is the value of v,(¢*) for sample r. 
The (percent) coefficient of variation of the jackknife 
variance with respect to the true MSE is estimated by: 


CVIv,(t")] = 


({(1/4,000))> [v,,(¢*)- MSE,,, ]°}“/MSE,,,,) x 100; (20) 
that is, the estimated root mean squared error of the 
variance estimator divided by the estimated true MSE, 


expressed as a percentage. 


5.2 Results of the Study 


Table 1A gives the estimated percent relative biases of 
the three point estimates for the total number of employed 
using equation (18), and Table 1B gives the same for the 
employment rate. All biases are less than 1% in absolute 
value. 


Table 1A 
Percent Relative Bias of the Point Estimates 
for Total Number of Employed 


Estimator m,=M, m, =50 m, = 20 m,=10 m,=5 
REE - 0.14 (i =0:29 -0.56 
DEE - 0.16 -0.01 0.03 0.115 
EERE 0.04 - = = x 

Table 1B 
Percent Relative Bias of the Point Estimates 
for Employment Rate 

Estimator m,=M, m, = 50 m, = 20 m, = 10 m,=5 
REE = -0.09 -0.31 -0.19  -0.26 
DEE S -0.08 -0.27 -0.12  -0.13 
FFPE -0.09 - - - - 


REE - Reweighted Expansion Estimator (t,) 
DEE - Double Expansion Estimator (f,) 
FFPE - Full First Phase Estimator (¢,) 
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Not displayed are the Monte Carlo estimates of the mean 
squared errors (i.e., the values of MSE,,,.) and the 
corresponding coefficients of variation from using either 
the reweighted or double expansion estimator. This is 
because the focus in this article is on mean squared error 
estimation. The mean squared errors (and coefficients of 
variation) from using the two estimators are comparable for 
each sample size (a relative difference in the coefficient of 
variation is roughly half of the corresponding relative 
difference in mean squared error). The reweighted 
expansion estimator is slightly more efficient when 
estimating the total number of employed individuals (e.g., 
when m,=5, the double expansion estimator has 17% 

g : 
more mean squared error). There is less than a 1% 
difference in the mean squared errors from using the two 
approaches when estimating the employment rate. Not 
surprisingly, the mean squared errors for all estimators 
increase as the second-phase sample size decreases. 

Table 2A gives the estimated percent relative biases of 
the jackknife variances for the total number of employed 
using equation (19), and Table 2B gives the same for the 
employment rate. Focusing first on Table 2A, the full first- 
phase estimator’s variance is almost perfectly unbiased, at 
0.94%. The jackknife for the reweighted expansion 
estimator works well, having small negative biases in the 
variances always less than -6%. The biases tend to become 
more negative (although not uniformly) as the second-phase 
sample sizes diminish. 


Table 2A 
Percent Relative Bias of Jackknife Variances 
for Total Number of Employed 


Estimator m,=M, m,=50 m,=20 m,=10 m= 5 
REE - -0.99 -2.51 -5.81 -5.13 
PEE 46.35 68.24 78.18 86.22 

(Variant 1) ; ‘ i ‘ 
DEE 101.59 278.44 654.99 1997.51 

(Variant 2) : : . : 
FFPE 0.94 - _ - - 

Table 2B 
Percent Relative Bias of Jackknife Variances 
for Employment Rate 

Estimator m,=M, m,=50 m,=20 m,=10 m,=5 
REE - -3.53 -3.45 -7.09 -6.55 
oe 2.46 1.53 52 7.41 

(Variant 1) : ; : 
DEE 0.36 4.91 9.09 30.46 

(Variant 2) ; ; ; ‘ 
FFPE 2.08 - - - - 


REE - Reweighted Expansion Estimator (¢,) 

DEE - Double Expansion Estimator (¢,) 

FFPE - Full First Phase Estimator (¢,) 

Variant 1 uses the jackknife replicates in equation (16) 
Variant 2 uses the jackknife replicates in equation (17) 


In contrast, both jackknife variants for the double 
expansion estimator fail miserably, with very large positive 
biases in the variances ranging from 46.35% to 1997.51%! 
The second variant is worse than the first, but both are well 
beyond the realm of acceptable behavior. 

Table 2B repeats the analysis for the ratio estimate of 
employment rate. The results here are surprising since all 
variance estimators behave reasonably well, with the 
exception of variant 2 of the double expansion estimator 
when m,=5. Other than this case where the bias in the 
variance is 30.46%, all other biases are less than 10% in 
absolute value. 

Overall, Table 2A and 2B provide strong support for 
using the jackknife variance estimator with a reweighted 
expansion estimator even when second-phase sample sizes 
are surprisingly small. By contrast, the jackknife can fail 
miserably for the double expansion estimator when 
estimating totals. Sometimes, however, variant 1 can also 
work reasonably well depending on the estimator and the 
data. 

Although most studies focus on the bias of the variance 
estimators, it is also of secondary interest to look at the 
coefficient of variation of the variance estimators to see 
how stable the variance estimates themselves are. In Tables 
3A and 3B, we investigate the estimated (percent) 
coefficients of variation corresponding to the total number 
of employed and the employment rate, respectively. In 
equation (20), the expression under the square root in the 
numerator gives the MSE of the variance, whose 
component parts are the square of the bias of the variance 
and the variance of the variance. For those entries in Tables 
2A and 2B where the bias of the variance has been 
determined to be exceedingly large (say larger than 20%), 
the corresponding entries in Tables 3A and 3B are not 
reported (indicated by a *), since it is clear that those entries 
will be excessively large. In Table 3A, the estimated 
coefficients of variation corresponding to the reweighted 
expansion estimator range between 46.86% and 53.42%. 
Coefficients of variation of the magnitude exhibited here 
are typical for variance estimators, and have been 
encountered in other simulation studies relating to 
variances. See, for example, Kovacevi¢ and Yung (1997). 
To that end, note that even the estimated coefficients of 
variation corresponding to the full first-phase estimators are 
in the same range, and in fact, somewhat higher than those 
of the second-phase estimators in all cases. 

Table 3B, which gives the coefficients of variation for 
the variances of the estimated employment rates, are entry 
by entry higher than their counterparts in Table 3A. In 
addition, all estimators exhibit the pattern that their 
corresponding coefficients of variation increase, quite 
substantially in fact, as the second-phase sample sizes 
diminish. This effect is more pronounced for the ratio 
estimators than it is for the estimators of the total. The very 
high coefficients of variation in the column m, = 5 for both 
tables is not surprising, since the overall second-phase 
sample size (25) is actually smaller than the number of 
PSU’s drawn in the first phase of sampling (36). In fact, a 
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Table 3A 
Coefficient of Variation of Jackknife Variances 
for Total Number of Employed 


Estimator m,=M, m,=50 m,=20 m,=10 m,=5 
REE - 51.33 49.3 46.86 53.42 
DEE ~ * * * * 

(Variant 1) 
DEE = * * * * 
(Variant 2) 
FFPE 56.71 - - - - 
Table 3B 
Coefficient of Variation of Jackknife Variances 
for Employment Rate 
Estimator m =M, m,=50 m,=20 m,=10 m,=5 
REE - 59.28 65.66 74.26 103.06 
DEE - 59.24 66.16 72.89 99.1 
(Variant 1) 

DEE - 60.94 WP 92.71 “ 
(Variant 2) 

FFPE 78.42 - = - 


REE _ - Reweighted Expansion Estimator (¢,) 

DEE - Double Expansion Estimator (¢,) 

FFPE - Full First Phase Estimator (¢,) 

Variant 1 uses the jackknife replicates in equation (16) 
Variant 2 uses the jackknife replicates in equation (17) 


more relevant realized sample count for the ratio estimator 
is the number of sampled individuals in the labour force 
(i.e., in the denominator). This value varies from sample to 
sample and is often considerably less than 25. 


6. EXTENDING THE REWEIGHTED 
EXPANSION ESTIMATOR 


6.1 The Reweighted Expansion Estimator 


It is not that difficult to develop a linearization variance 
estimator for the reweighted expansion estimator in 
equation (2). Suppose, however, one had a sample design 
with more than two phases or was interested in estimating 
the ratio of two totals. Linearization, although still possible, 
becomes increasingly cumbersome. The jackknife, on the 
other hand, does not. 

It is a simple matter to generalize the results in Section 
3 to p-phase sampling by induction. The A still refer the 
first-phase strata, but the g now denote the p-th-phase 
strata; S Pp is the set of elements in the (p-1)th-phase sample 
from stratum g while s, is the pth-phase subsample from g. 
The w, in equation (2) are replaced with the a, from (3) 
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for the (p-1)th-phase estimator. Similarly, the 4, in the 
jackknife are computed using a, ii from the (p-1)th phase in 
place of the w, ... 

It is also a simple matter (left to the reader) to replace the 
stratified cluster sample in the first phase of selection with 
a stratified multi-stage sample. The results in Section 3 
follow as long as the first stage of the multi-stage sample is 
drawn with replacement. 

Finally, it is not difficult to extend the results of 
Section 3 to more complicated estimators. Let U, be a 
vector of estimators each in the form of ¢, from equation 
(2). The mean squared error of any estimator © = g(U,), 
where g is a smooth function, can be estimated with a 
jackknife in a nearly unbiased manner whenever the 
members of U, can be. This follows the proofs in the 
literature. Rao and Wu (1985), for example, address the 
asymptotic framework where the 1, are all bounded, while 
Wolter (1985; Chapter 4.5) treats the case where the n, 
grow arbitrarily large. 


6.2 Regression in the Second Phase 


The estimator ¢, can be generalized into the regression 
estimator: 


Loree =) win Se weds; (z wierd:x,y, > (21) 
ieS ies i€s 
where S denotes the original sample, x, is a row vector, d, 
is a scalar, and there exists a row vector y such that 
d,yx,' =1 for alli. In practice, d, is usually 1 for alli. A 
popular exception occurs when x, =x, and d, = 1/x,. In 
equation (2), d, = 1 for all i, and x, is a G-vector with a 
value of 1 in the g-th position and 0’s elsewhere for 7€S_. 
Bet 


A, a) eh *(X widrx,%;) { waxy) 
ieS ieS 
The replicate brreg(hj) has the same form as ¢,,., except 
that Wii replaces w, everywhere. Similarly, Thi has the 
same form as 7, except that W iyi replaces w,. Note that the e, 
are unchanged from treg 0 Forest) 

Since the sampling design hasn’t changed, most of 
equation (6) stays as is except that now (Yies wir) is 
nonnegative rather than strictly zero. The interested reader 
can verify that equations (10) through (13) remain in their 
present form. It turns out that the jackknife has, if anything, 
an (approximate) upward bias in equation (14). That is to 
say, the jackknife is a conservative estimator of variance. 
Again, see the apppendix (equations (A6) through (A9)) for 
a formal statement of the asymptotic assumptions. 

The bias in the jackknife disappears when )’,.. w,r; = 0 
for all g. Formally, this will happen when there“exists G 
row vectors Yj, .-.,¥g such that qi¥X;. =1 when ieéS 
and 0 otherwise (since )),..w,r; = Yoies FV g%y WT; = 
Ye Lies WiFi) Ti = Ve (Lies WiF 44 OW, - %LLieg WF Hl 
LiesWi4;*; ¥;)} = 0). When all d, = 1, the existence of y, 
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means that either one member of x is an indicator variable 
equal to 1 when ie S, and 0 otherwise, or one member of a 
linear transform of x, is such an indicator variable. 


7. CONCLUDING REMARKS 


The main purpose of this paper was to show that a 
simple jackknife variance estimator can be nearly unbiased 
for an estimation strategy involving two-phase sampling as 
long as that strategy employs a reweighted expansion 
estimator and not a double expansion estimator. Since the 
theoretical results for the reweighted expansion estimator 
rely on asymptotic arguments, their practical application 
will depend on the context. Nevertheless, a Monte Carlo 
simulation study performed here suggests that the jackknife 
can be an effective estimator for the variance of a 
reweighted expansion estimator even with surprisingly 
small second-phase stratum sample sizes, that is, sizes of 5 
and 10. 


APPENDIX 


The Design Consistency of the Reweighted Expansion 
Estimator 

To establish the design consistency of t, in equation (2) 
it is sufficient to assume that the sample design and 
population values of the y, are such that 


G 
» (M,/m,) >) “oi -1=0,(NWm), 


g=l i€S, 


and, given any first-phase sample, 


>>. w,/>d w,| (m,/M,) -l= O, (1m) (Al) 


keS, kes, 


for all g. These assumptions justify equation (5) in the text. 

We assume in our analysis that G is bounded and that 
each m, has the same asymptotic order as m. This is only 
possible when the S, are determined after the first-phase 
sample has been drawn. Otherwise, the M, would be 
random variables, and a minimum size for each m, could 
not be guaranteed for all possible first-phase samples. In 
principle, we are assuming the existence of a mechanism 
for determining the S, and the second-phase sampling 
fractions given any first-phase sample. By contrast, the 
exact values of G and the m, can but need not be fixed 
before the first-phase sample is drawn. 


A Comment on the Asymptotic Framework 

Recall that the text showed that the jackknife contains a 
component that estimates the second-phase variance (i.e., 
Ee t,)°1) in an asymptotically unbiased manner given 
any first- phase sample (see equation (14)). As a result, that 
component also estimates the average (i.e., unconditional) 
second-phase variance across all possible first-phase 
samples (7.e., E,{E,[(, - ial }) in an asymptotically 
unbiased manner. 


In our empirical work, we strayed from the sampling 
framework described above so that the results could be 
easily summarized. In particular, we defined the S. 
beforehand, and let the MZ be random. When the first- 
phase sample was such that 7, was less than the desired 
m.,, (say 50) in some second-phase stratum, we planned to 
choose all the individuals in S, for the second-phase 
sample. As a result, there would be no contribution to the 
mean squared error (or bias) of ¢, from second-phase 
stratum g when that particular first-phase sample was 
selected, and so no asymptotic assumptions about m 
would be necessary. As it happened, in no simulation was 
M, actually less than 50. Nevertheless, a decision rule 
about the second-phase sampling fractions was in place for 
every possible first-phase sample. 


Jackknife Replicates 

There are (at least) two distinct asymptotic frameworks 
for the first-phase sample. In the first, there is an arbitrarily 
large number of first-phase strata each of which is bounded 
in size; that is, each 1/n, = O(1) while 1/H = O(1/m). In 
the second, all the first-phase strata are arbitrarily large; 
that is, 1/n, = O(1/m). Under either framework, we assume 
that the number of elements in each cluster is O(1); that is 
to say, bounded. 

Since every m, is of the same asymptotic order as m, it 
is not unreasonable to assume under either regime that, 
given any first-phase sample, 


» Wal De w,- 1= O,(1/m), (A2) 
and 
YS waif »¥,- 1 = O,(im), (A3) 


IES, IES, 
which can be used to establish equation (9). Similarly, we 
assume that given any first-phase sample 


De RSA VOY ena O,,(1/m), (A4) 
ieS, i€S. 


which assures us that r,,,- 7, = O,(1/m). 


Equations (12), (13), and (14) 

Since the number of elements in each cluster is 
bounded, say by B. The third term on the right hand side of 
equation (12) has at most GB” terms, a bounded number. 

Each of these terms is of order 1/m, (formally, the 
probability that any one term is of asymptotic order greater 
than I/m, is zero). Consequently, the second line of 
equation (12) is asymptotically ignorable. 

Equation (14) holds when each 1/n, = O(1), because if 
each n, is less than C (say), then the third term on the right 
hand side of equation (13) will be the sum of at most 
G(BC)* terms, a bounded number. Each of these terms is 
again of order 1/m z Consequently, the second line of 
equation (13) is asymptotically ignorable. 

Alternatively, suppose each 1/n, were O(1/m). We will 
assume that the sample design and population is such that, 
given any first-phase sample, 
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A, = De w,(e,¢, - Dn / Xe wy; =O, (Vm) (A5) 

icf, ieF, 
for all h. To see why this is a reasonable assumption, 
observe that conditioned on the first-phase sample, the 
denominator of A, is a domain total — the sum of the w,y, 
among the elements in F,. Consequently, it is O(m) 
(without loss of generality we can assume that all the w, are 
O(1)). The numerator of A, is the difference between an 
expansion estimator (the sum of the w,e,c.r, in He , ) based 
on a Stratified simple random sample and its target (the sum 
of the w,r, in F;,). Equation (A.5) makes the modest 
assumption that the sampling design and population is such 
that this difference is O AG, m) for every possible first-phase 
sample. 

Under assumption (AS), ier; w;Z; = Yee 
is approximately equal to sae Wij» which implies 
E,l(Yiers , 2, Vn, = Liew: ym, ‘Equation (14) 
follows from this near equality and from equations (11) and 
(12) (since n, is large, n,/(n, - 1) = 1). 


Counter-examples to the Jackknifes for the Double 
Expansion Estimator 

As a counter-example to the replicate form in equation 
(16), consider the situation where each cluster contains a 
single element, H = G = 1, and all the y, values are equal 
to 1. As a result, t, = 7, which means that t, has no 
variance. Unfortunately t(,;)3 = T[n,/(n, - 1)]@n- 1)/m 
When jes and Jn,/(n,- 1) _ otherwise. Thus, 
(laps — PT = O, Cm). Now v,,/T” computed from the 
taj)3 would also ie O(1/m) since i is the sum of n, terms 
of order O(1/m?). 

Although v,,/T* is O(1/m), v,, is not close enough to 
zero for our purposes. To see why, observe that if the y, 
were all N(1,1), then the relative variance of t, would be 
1/m, which is also O(1/m). Thus, for v 73 to be nearly zero, 
Vall would have to be smaller than O(1/m). It is not, 
and the jackknife variance estimator is not nearly unbiased. 

As a counter-example to the replicate form in equation 
(17), consider the situation where each cluster is again a 
single element and all y, values are equal to 1, Pes now 
H =m, G = 1, the population size in each h is Ny, n, = 2 
for allh,and M, =2m. As a result, T =t, = mN), so. “ene 
t, has no variance. The replicate ine can take on 
four possible values. If Ajes and hj’es(j#/’), then 
toys ihee (2m -1)\(m- 1)|Ny. If hjes and hj'€s, 
then f(,;y3 = [({m OR OA he 1m - 1)]N. If Aj €s and 
hj'es, then tity 3 = = [(m/2)(2m - 1)/m]N). “Te hj¢és and 
hj’ ¢s, then  tiy3 =[({m - 1}/2)(2m - 1)/m]Np. In all 
cases, (ti,)3 - TT =O p(l/m), and so the jackknife 
variance estimator fails to be nearly unbiased. 


w.y,(1 + A,) 


The Two-phase Regression Estimator 

To support the arguments in the text about the regression 
estimator in equation (21), we assume the sampling design 
and population values are such that the following 
asymptotic relationships hold. First, 
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DY) wx, w,e,4,x;x;)'d.x/-1=O0(1W/m), (A6) 
ieS iés 
which is a generalization of equation (Al). Likewise, 
equations (A2) and (A3) generalize to 
Y w4:9;/ 0 w,4,9,- 1 = 0,(m), (A7) 
ieS, ieS, 
and 
YS we 49,/ do w,e,4,q,- 1=O0,(lm) (AB) 


i 
ES, ies, 


for all g,, where q, is an element of the matrix oc. 
Finally, the assumption in equation (A4) generalizes to 


Ds WA P| > wid; 
ieS, i€S, 


for all p,, where p, is an element of the matrix x,’ y,. 


- 1=0,(1/m) (A9) 
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A Synthetic, Robust and Efficient Method of Making 
Small Area Population Estimates in France 


GEORGES DECAUDIN and JEAN-CLAUDE LABAT' 


ABSTRACT 


Since France has no population registers, population censuses are the basis for its socio-demographic information system. 
However, between two censuses, some data must be updated, in particular at a high level of geographic detail, especially 
since censuses are tending, for various reasons, to be less frequent. In 1993, the Institut National de la Statistique et des 
Etudes Economiques (INSEE) set up a team whose objective was to propose a system to substantially improve the existing 
mechanism for making small area population estimates. Its task was twofold: to prepare an efficient and robust synthesis 
of the information available from different administrative sources, and to assemble a sufficient number of “good” sources. 
The “multi-source” system that it designed, which is reported on here, is flexible and reliable, without being overly complex. 


KEY WORDS: Population estimates; Administrative files; Robust estimation. 


1. INTRODUCTION 


In France, as in all countries that do not have population 
registers, censuses of the population are the cornerstone of 
the socio-demographic information system. However, 
censuses are quite massive operations that cannot at present 
be carried out more often than once every seven or eight 
years. In the interval between censuses, it is therefore 
necessary to update some information, especially at a high 
level of geographic detail, particularly since for various 
reasons, censuses are tending to be less frequent. Thus, 
small area population estimates are a major challenge for 
the Institut National de la Statistique et des Etudes 
Economiques (INSEE). 

Despite the progress achieved in this field, the situation 
in 1993 still seemed fairly unsatisfactory. When figures 
from the 1990 population census were compared to the 
population estimates made on the basis of the previous 
census (1982) for the metropolitan departments, the 
differences noted were sometimes sizable. 

_ INSEE therefore created a methodology team whose 
mission was to propose a system that would substantially 
improve the existing mechanism. Initially, the next census 
was to take place in 1997. It therefore seemed reasonable to 
have the new system operate on an experimental basis until 
the census, so as to see how well it worked before using it 
in actual production. When the census was postponed to 
1999, it became more necessary to bring the project to a 
successful conclusion quickly, so as to be able to use the 
new system in 1996. 

To achieve its objective, the team devoted itself, with 
maximum pragmatism, to a twofold task: to develop an 
efficient and robust synthesis of the information available 
from different administrative sources, and to assemble a 
sufficient number of “good” sources. The “multi-source” 
system that it designed, which is described here, is not 
overly complex and seems effective. A more detailed 
description of it is provided in Decaudin and Labat (1996). 


2. MAIN CONCLUSIONS 


The team’s main conclusions are as follows: 

1) It is impossible to improve total population estimates 
using sample surveys, unless the survey is conducted 
on such a scale that it would be similar to a census. 

2) No single administrative source adequately reflects 
changes in the population. At the local level, all 
sources can exhibit drift, breaks, jolts, etc., which are 
not always easy to detect. Furthermore, even at the 
local level, it is often quite difficult if not impossible to 
get the agency responsible to provide explanatory 
details, much less corrections in the case of errors. In 
any event, it is unwise to rely on a single administrative 
source, however good it may be, since its permanency 
is never guaranteed. 

3) On the other hand, total population estimates can be 
improved substantially by simultaneously using several 
sources. A “multi-source” system, similar to the one 
presented here but more rudimentary, was tested 
retrospectively over the intercensal period 1982-1990, 
for the 96 metropolitan departments. The mean error 
(mean deviation as an absolute value from the results of 
the March 1990 census) fell below 0.9%, whereas the 
mean error registered at the time, with the estimation 
system then in place, was 1.4%. 


3. SIMULTANEOUS USE OF SEVERAL 
SOURCES 


For using several sources jointly, different methods are 
possible. 

A method that is universal — and easy to implement — is 
multiple regression. In simplified form, this amounts to 
using, for any area z, the following relationship: 


IN KAWAI  S Ox: > (KON G+ 1, z)/N.(%, Z)), 
s 


' Georges Decaudin and Jean-Claude Labat, Institut National de la Statistique et des Etudes Economique, 18, Blvd. Adolphe-Pinard, 75765 Paris, CEDEX 14. 
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where P(n, z) is the population of area z on January 1 of 
year n, the values N.(n,z) are the numbers from each 
source S on the same date and ky are coefficients, which are 
estimated by multiple regression over a past period. Here c 
is a constant term that is used only in the regression, with 
calibration on the national population serving to correct any 
drift. 

This method is used in various countries, including 
Canada and the United States (for example, see Statistics 
Canada 1987 and Long 1993). Nevertheless, it was not 
adopted because it has numerous drawbacks: 


— it must be possible to estimate the coefficients, which 
requires data from each source extending back over a 
fairly long period; 

— the coefficients can change over time, without it being 
possible to control this change; 

— as noted above, the administrative sources are, for 
various reasons (changes in regulations, abrupt shifts in 
management, errors, efc.), subject to what might be 
called “anomalies”. For each source S, the scope of 
these anomalies is reflected in part in the coefficient k,, 
to an extent that depends on how great their medium- 
term effect has been over the calibration period [la 
période d’étalonnage]; but anomalies nevertheless occur 
in estimates with the same weight as the “good” data 
from the same source. The estimates are then highly 
distorted. 


Another method is known as the “composite” method. 
Each source is used to estimate the population in one or 
more age classes: age class _X, which is well-covered by the 
source, but also sometimes another class that definitely 
exhibits a pattern very similar to that of class X (for 
example, the “30-45” age group, if X represents the “under 
18” age group). It is then necessary to have appropriate 
indicators for the other components of the population and 
correctly manage the consolidation of these estimates “in 
parts”. 

This type of method, used in the United States (Long 
1993), seemed to us to be problematic, especially because 
of the difficulty of adequately dealing with “anomalies”. 

The proposed “multi-source” system is based on a robust 
synthesis of estimates from different sources. It combines 
demographic reasoning with purely statistical techniques. 
It draws on the experiments conducted by the INSEE’s 
regional directorate in Brittany in the early 1970s (Laurent 
and Guéguen 1971; Guéguen 1972). Should one of the 
sources fail, such a system is not prevented from 
functioning, even though its performance may be somewhat 
diminished. 


4. DEMOGRAPHIC BASE 


The demographic reasoning which is at the base of the 
system is elementary: assuming that we know the total 
population P(m) for an area on January 1 of year n, the 
population P(n + 1) of the area on January | of year n + 1 


is deduced by summing the two components of the change 
during year n: natural increase (births minus deaths), and 
net migration (immigrants minus emigrants). 


P(n + 1) = P(n) + Mn) - Din) + Ln) - En). 


In France, natural increase data are provided annually at 
the commune level by vital statistics. If the latter are not yet 
available in final form, which is often the case in the third 
quarter of year + 1, it is easy to estimate them with a low 
margin of uncertainty. 

The only unknown, then, is net migration for year n: 
SM(n) = I(n) - E(n) or what amounts to the same thing, the 
net migration rate 7(n) =SM(n)/P(n). In other words, 
estimating the population comes down to estimating net 
migration since the last date on which the population is 
known (or is assumed to be known), and vice versa. 

In France, net migration figures are of some importance, 
although less so than in other countries such as Canada or 
the United States. In addition, they generally exhibit a 
certain inertia, at least at relatively aggregated geographic 
levels. One way to assess the influence of changes to them 
from one intercensal period to the next is to measure the 
errors that would have been committed during each period 
if the population had been estimated by using the average 
annual net migration rates for the preceding period. Over 
the period 1982-1990, for the departments (excluding 
Corsica), the mean end-of-period error (in 1990, at the end 
of eight years) would have been only 1.3%. It was not 
certain, when the team started its work, that much greater 
accuracy could be achieved. However, both in 1975 and in 
1982, the mean error that would have been committed with 
the trend method would have been much greater: 2.8% and 
2.7% respectively (over seven years). It would therefore 
seem that the period 1982-1990 was exceptional and that in 
the future the difference will again be more pronounced. 


5. ESTIMATES FROM THE 
DIFFERENT SOURCES 


From each source, using an appropriate method, we draw 
an estimate of annual net migration rate for the population 
as a whole. The methods that may be used depend on the 
data available. 

For each of the sources tested and found to be “good”, at 
least at the departmental level, a method is proposed. The 
five sources retained are the following: housing tax; 
electrical utility customers; children receiving family 
allowances; educational statistics; electoral file. 

The data on the composition of households for tax 
purposes, which appear in the income tax files, are the sixth 
source that should provide very good results. However, to 
date, these data have been analysed for only a few 
departments, and the methodology for using them is not yet 
completely defined. 
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We also propose to integrate a trend estimate of the net 
migration rate into the system. 

Two categories of methods are used. The first concerns 
the sources relating to households; the second concerns 
those relating to individuals. 


5.1 Sources Relating to Households 


Some sources provide information on changes in the 
number of households. This is the case with the files on 
housing taxes (HT) and electrical utility customers (EUC). 
The housing tax is one of the four main local direct taxes. 
As its name indicates, it applies to occupied dwellings, with 
main residences and secondary residences being treated 
separately. The housing tax file takes account of the 
situation on January 1 of the taxation year. Starting in the 
1980s, the HT source was the basis for the departmental 
population estimates developed by INSEE (Descours 1992). 
In the early 1990s, it was replaced by the EUC source, in 
light of the distortions caused by a change to the HT 
management system which gradually worked its way 
through all departments. 

The method adopted for using these sources follows 
classical principles. It leads directly to an estimate of the 
total population, and it involves three main stages: 

1) estimating the number of households; 

2) estimating average household size and from there, 
estimating the population of households; 

3) adding the “non-household” population. 

In the first stage, it is assumed that the number of 
households changes in accordance with the data supplied by 
the source (number of main residences for HT purposes or 
number of electrical utility customers). The second stage is 
more delicate. It is based on both the use of statistics on 
dependants from the HT files and on a trend estimate of 
average household size. 

In the proposed “multi-source” system, we move on to 
the net migration rate, for comparison with other sources, 
using vital statistics data (cf. Section 4). 


5.2 Sources Relating to Individuals 


The other sources used concern individuals. Only a 
certain age group X of the population is generally covered 
adequately. The method then involves two main stages: 

1) estimating, from the source, the net migration rate for 
the population aged_X; 

2) from there, estimating the net migration rate for the 
population as a whole. 

The second stage is based on the following statistical 
relationship, observed in the past, between the change, from 
one period to another, of the overall net migration rate (7) 
and the change in the net migration rate for the population 
aged X (TX): 


(CAD CHM OR. EOS 0.6) 


where 6, is a coefficient close to 1, depending on the age 
group X. This relationship is similar to the one used by 
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de Guibert-Lantoine (1987) to estimate the population on 
the basis of educational statistics. 

For the corresponding age groups in the different sources 
used, the values, estimated by linear regression, of the 
coefficient 5,(+/-2 standard deviations) are shown in 
tables 1 and 2. 


Table 1 
Estimates of dy on Departments, Excluding Corsica, 
Internal Net Migration 


Age at end of period 
35 and over 


1962-1968 |1968-1975 | 0.76 (+/- 0.04) | 0.69 (+/-0.06) | 1.24 (+/-0.09) 


1968-1975 |1975-1982 | 0.77 (+/-0.03) | 0.88 (+/-0.06) | 1.56 (+/-0.08) 


1975-1982 |1982-1990 | 0.70 (+/-0.11) | 0.49 (+/-0.10) | 1.26 (+/-0.17) 


Table 2 
Estimates of by Over the Two Periods 1975-1982 and 
1982-1990, Excluding Corsica, Total Net Migration 


Age at end of period 


0-18 9-15 35 and over 


Departments 0.65 (+/-0.11) 0.57 (4/-0.10) 1.22 (+/-0.16) 


Department — 


employment zone 0.65 (+/-0.04) 0.59 (4/-0.04) 1.17 (4/-0.06) 


The approach followed in the first stage depends on the 
source: 


Electoral File 


Annual migration figures for voters in the selected age 
group (30 and over) are supplied directly by the electoral 
file managed by INSEE. We go from the rate of net 
migration of voters to the residential net migration rate by 
dividing the former by a coefficient reflecting the 
magnitude of the change in the electoral file. 


Educational Statistics 


The net migration figure for those in the 5-9 age group 
is obtained by subtracting their number in year n from that 
of the same cohorts the next year (that is, from those in the 
6-10 age group in year + 1) and deducting deaths. 


Children Receiving Family Allowances 


The number of persons in the 0-17 age group is 
estimated on the assumption that it evolves similarly to the 
number of children receiving family allowances. From this 
a figure for the net migration of young persons is obtained 
by comparing this estimate to a hypothetical change in the 
youth population without migration, that is, a change due 
solely to natural increase. 
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6. SYNTHESIS 


6.1 Principles 


The different basic estimates of the annual net migration 
rate are treated statistically in order to obtain a “synthetic 
rate’, to be used as the final estimate. The treatment serves 
to eliminate outliers, underweight suspect values and, more 
generally, assign to each source a weight that reflects its 
performance. 

More specifically, since each source can “drift”, the 
different basic estimates are generally biased; they are first 
corrected for the national bias of the corresponding source 
for the year considered, a bias that is estimated in advance. 
In proceeding in this way, we implicitly assume that the 
difference between the local bias and the national bias is 
minor in relation to the irreducible unexplained portion of 
the difference (flou irréductible). Once we have estimates 
for a number of years, it should be possible to test this 
hypothesis and if necessary, replace it with one that 
corresponds more closely to reality, so as to improve the 
correction of biases at the local level. 

It should be noted that such a seemingly simple operation 
as correcting the national bias nevertheless requires several 
precautions. The solution that consists in carrying out a 
gross calibration on the national net migration rate, 
considered by definition as a good reference, is not very 
satisfactory, owing to anomalies that may distort the 
calibration. It is therefore preferable to estimate the biases 
by means of a process in which we also eliminate 
anomalies. The process is similar to the one used for 
synthesis, which is described below. However, the deter- 
mination of biases, assumed to be national in scope and 
therefore calculated for 96 departments, is less sensitive to 
anomalies than the determination of synthetic rates, 
calculated over a small number of sources. Only major 
anomalies are likely to significantly throw off the cali- 
bration of the rates and must therefore be corrected. 

The “synthetic” net migration rate is a weighted mean of 
the basic estimates thus calibrated. Each source S is 
assigned an initial weight W, that is supposed to reflect its 
medium-term accuracy. But in addition, for a given year 
and area, this weight is modulated to take account of the 
plausibility of the corresponding rate. Thus, if a rate is 
“abnormally distant” from the rates obtained from other 
sources — in practice, from a central value for all rates for 
the area — its weight is cancelled or reduced. For this, we 
look at the distance between the rate obtained from each 
source and the central value identified, and we compare it 
to a “norm” of distance NOs specific to the source, deter- 
mined empirically on the basis of the data available: if the 
distance is less than “a times the norm”, the weight is not 
automatically changed; if it is greater than “b times the 
norm’, it is set at 0; between the two, the weight is 
multiplied by a coefficient, included between O and 1, 
calculated by interpolation. 

Note that the trend estimate is formally treated like those 
from exogenous sources; its weight is cancelled when it is 


considered as implausible because it is too far from the 
other estimates. 

The synthesis is achieved automatically, which ensures 
homogeneity and an explicit logic to the treatments carried 
out. This does not, however, eliminate the need to control 
the results obtained. 


6.2 Theoretical Presentation 


On the theoretical level, we sought to use reasonings and 
robust estimation techniques, such as described in Hoaglin, 
Mosteller and Tukey (1983). The method adopted falls 
within the framework of (/-estimators of central tendency 
and more specifically in the category of W-estimators, 
which use the reweighted least squares algorithm. 

Since the net migration rates for year m and area z 
obtained from different sources S (and corrected for their 
national biases) are denoted TC,(n, z), the synthetic rate 
T(n, z) solves the implicit equation: 

ECHEZ) Le 2) 
> W, . NO, . ¥&(————) = 0, 
5 NOs 
where the function ’ is of the type that redescends to a 
finite rejection point: 


Wr) =r for |r|< a, 

ye eae for a< |r|< b, 
Daa 

Vir) =0 otherwise. 


Using an iterative process, we can gradually refine the 
automatic processing of suspect data. 


6.3 First Analysis of the Distances From Each Rate 
to the Central Value for the Rates 


1) For each area z we calculate a first central value of the 
“calibrated” rates TC,(n,z). The central value used 
must not be overly sensitive to the possible existence of 
quite distant values for some sources, but at the same 
time it must be influenced by a source to the extent that 
the source is on average more accurate. Under these 
conditions, rather than choosing the median — which 
would meet the first condition — we use a statistic of 
rank that is a little more elaborate but nevertheless 
simple, owing to the small number of values; this 
Statistic is the mean, weighted by respectively 1/2, 1/4, 
1/4, of the three quartiles: 

— the median of the rates TC,(n, z) weighted by the 
initial weights W,, 

— the lower quartile (Q1) of the weighted rates, 

— the upper quartile (Q3) of the weighted rates. 


2) The rates 71(n, z) thus obtained are calibrated on the net 
migration rate for the higher level, by simple translation: 


Cin 2) =2 1G, 2) + 


TREF(n) - > (T1(n,z)P(n, 2)) ‘} YS Pn, z) 
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where P(n, z) is the population of area z on January 1 of 
year n and TREF(n) is the net migration rate for the 
higher level (the national rate for the departmental 
synthesis). 

3) For each area, we calculate the differences between each 
rate and this calibrated central value: 


EC1,(n, 2) = | TC,(n, z) - TC1(™, 2) |. 


4) For each source and each area, the size of this difference 
is assessed in relation to the “norm” of distance NO, 
specific to the source. This “norm” is determined 
empirically on the basis of the available data: 
theoretically it is the average of the distances observed 
in the past, excluding anomalies. The result is a first 
modulation of the weight originally assigned to this 
source: 

— if EC1,(n,z) < alNO,, where al is a parameter to 
be chosen (in the vicinity of 2), we do not change 
Ws, the initial weight for S. In other words, if 
WM1 ,(n,z) is the modulation coefficient of W, 
(coefficient included between O and 1), we take 
WM1 .(n,z) = 1; 

- if EC1,(m,z)>5b1NO,, where b1 is another 
parameter (in the vicinity of 3), we set W at O, 
meaning that we eliminate source S: WM1,(n, z) = 0; 

- if alNO,<EC1,(n,z) < b1NO,, we interpolate 
WM1| ,(n, 2) asa function of the value of C1 .(”, z): 


WM1,(n, z) = (b1 NO, - EC1,(n, z))/((b1 - a1) NO,). 


5) At the end of this first phase, we therefore have new 
weights specific to each source and each area, which 
would allow us to locally eliminate or underweight 
suspect rates: W1,(n, z) = W;WM1 ,(n, 2). 


6.4 Iterations 


1) Using the weights thus modified W1 .(n, z), we estimate 
a new central value for each area, this time taking the 
weighted average of the rates: 


T2(n, 2) = > (TC, (n, 2W1,(, 2)) / YY Wi, 2). 
S S 


2) We calibrate each rate T2(n, z) on the net migration rate 
for the higher level, by translation. We obtain 
TC 20122): 

3) We calculate, in each area, the differences between each 
rate and the calibrated average rate: EC2,(n, z) = 
| TCM, z) - TC2(n, z) |. Using these differences, we 
calculate new modulation coefficients for the initial 
weights, using the parameters a2 and 52, which may be 
different from a1 and b1 (theoretically they would be 
lower). We thus obtain new weights W2,(n, z) which 
more effectively take account of anomalies, since the 
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latter are assessed in relation to a better central 
tendency. With these weights, we estimate a new 
synthetic rate 7'3(n, z), which is calibrated on the higher 
level to obtain TC3(n, z). 

4) The operations described in point 3 are repeated with the 
same parameters a2 and b2. The tests conducted at the 
departmental level over the period 1982-1990 show that 
the convergence is generally rapid; the rates are quite 
often stabilized by the fourth iteration. 


7. IMPLEMENTATION AT THE 
DEPARTMENTAL LEVEL 


The estimation system outlined above, which is 
operationalized for 1990 and subsequent years, was 
implemented by the project team for the year 1990 at the 
departmental level, with the following five sources: housing 
tax (HT), electrical utility customers (EUC), family 
allowances (FA), educational statistics (ES), electoral file 
(EF), plus the trend estimate (TREND). 

Figure 1 shows the results obtained for several 
departments. Table 3 shows the values of the weights and 
norms used to make the system operate. This table also 
shows certain statistics obtained from the synthesis of the 
net migration rates; in particular they concern the 
differences between the rates obtained from each source 
and the synthetic rates. 


Table 3 


Implementation for Year 1990 at Department Level 
Parameters and Statistics 


Hie BUC WEAR WES WeEERRIEND 
Weight 115 100 80F 70 80 100 
Norm OSS O17 ONO MO 20S O19, ON 2 
Number of rates 96 96 89 96 94 96 
Average distance 0.55 0.14 0.30 0.19 0.14 0.13 
Number of “aberrant” rates Sy) 2; iN 3 1 6 


Average of distances 
without “aberrant” rates OUSPROsSw Os Gm OG OLS Ost 


Note: - Coefficients (a; b) applied to norms: (2,5; 3,5) in the first iteration, 

then (2; 3). 

- The values of the distances and norms correspond to rates expressed 
asa %. 

- Distances are calculated in relation to the synthetic rates after three 
iterations. 

- “Aberrant” rates are those for which the weight is cancelled after three 
iterations. 


The results suggest that the system is even more effective 
than indicated by the summary retrospective test carried out 
on the 1982-1990 intercensal period with the same sources. 
Aside from the HT source, which is still distorted, the 
estimates from the different sources are more convergent 
than they were on average in the retrospective test (see 
Table 4). 

There is nothing surprising about this, given the 
rudimentary state of the system tested on the 1982-1990 
intercensal period. The data used were rough or even 
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Net migration rate (%) 


Department 


Figure 1: 
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Electrical utility customers 
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Family allowances 


a 


Educational statistics 


Electoral file 


Synthetic rate (TC4) 


Summary of Net Migration Rates for 1990 for Twelve Departments, Identified by Number (49, 62, efc.). 


Note: TC4 is the synthetic rate obtained after three iterations. Where the weight for a source has been eliminated or reduced, the 


value of the modulation coefficient (WM3) is shown. 


fragmentary, owing to the difficulty of assembling, in 1993, 
management data for years past (1982, ...); in addition, the 
relationships used to draw an estimate of the net migration 
rate from each source were simplistic; and lastly, the 
method of synthesis was less elaborate. 

It should be noted that the integration of other sources — 
income tax data in particular — can only further reinforce the 
effectiveness of the system. 


Table 4 
Mean of Distance in Retrospective Test 


TH EDF AF EN RE 


1982 0.26. 0.249.050, 0.470. 40.34 
1983 0.28. 0:33. 048. = 0470.32 
1984 0.23 028 040 0.45 0.34 
1985 0:245 5 0316 = 048, 904490032 
1986 0:23 Ae 0:33:74) 0.400080:32 

1987 0:40 028A FOAT 1027 

1988 :84e020T 0 300.570 004 
1989 097°" 021°" 0:30 90.33%") 095 
Overall mean 0.43 030 041 039 0.32 


Notes: -The number of rates per year is generally 96, except for FA (89) 
and EF (94). 
-The “electoral file” source did not provide rates for 1986 or 1987. 
-The “housing tax” source began to be distorted in 1987. 
-The values of the differences correspond to rates expressed as a %. 


8. SUPPLEMENTS 


8.1 Sub-Departmental Levels 


The use of some sources may become risky at a geogra- 
phic level below the departmental level. There are various 
reasons for this: because the hypotheses on which the 
method is based become fragile, because the numbers are 
small, etc. This is especially the case with educational 
statistics. 

However, it should be possible to operate the system for 
employment areas, or more specifically for cross-tabu- 
lations of department and employment area (there are 
approximately 420 such areas), which serve to ensure 
consistency with the departmental level. This should not 
involve too many risks, for the following reasons: 

— acertain deterioration of performance in relation to the 
departmental estimates is acceptable, especially since 
the departmental estimates should be of good quality; 

- the data from the income tax files should be quite useful; 

— trend estimation and calibration on estimates at higher 
geographic levels (in this case the departmental 
estimates) both act as safeguards. 

Of course, there is nothing prohibiting the use of the 
system to produce estimates for other sub-departmental 
geographic units. 

At the departmental level, it does not seem useful to 
adapt the parameters (initial weights and norms) to 
population size; on the other hand, for sub-departmental 
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levels, such an adaptation appears essential. Otherwise we 
run the risk of being much too strict for small areas. It 
would seem that a norm function of the following type 
might be appropriate: 


NO, = aPP, 


where NO, is the norm for source S, P is the population of 
the area and a and Bf are two parameters that hypothetically 
depend.on source S. The parameter B is obviously negative. 
If B equals —-0.25, the norm doubles when the population is 
divided by 16. It also appears that the type of geographic 
area has an effect: the unexplained portion (le flou) would 
on average be greater for a commune of 50,000 inhabitants 
than for an employment area of the same size. The 
parameters a and £6 must be defined for each sub- 
departmental source, and where applicable, for each type of 
area. 


8.2 Timetable 


The greater the number of sources, the better the system 
functions. However, for a given year, data from the 
different sources become available at different times. Since 
the system is able to function with a variable number of 
sources, one can develop, at least at the departmental level, 
several sets of estimates for January 1 of year n: for 
example, interim estimates in the third quarter of year n, 
based on the first sources available, then semi-definitive 
estimates in the third quarter of year m + 1, based on more 
sources, and then final estimates in the third quarter of year 
n+2. Different factors must be taken into account: the 
complexity of an operation, and the magnitude of the 
changes due to the addition of a source. It will be possible 
to assess the latter factor by simulations on the first years of 
implementation of the system. 


8.3. Integration of an Additional Source 


The system is flexible and modular. Therefore, integra- 
ting a new source into it does not pose any particular 
problem. It is merely a matter of determining the method to 
be used in order to obtain a good estimate of the net 
migration rate for each area. The range of methods 
envisaged by the team is large enough that in most cases, it 
should be possible to find a type of method that is 
appropriate to the source. 

To determine the parameters (initial weight and norm) to 
be assigned to the new source in the synthesis, we suggest 
putting the system through a dry run, with parameters set 
arbitrarily but reasonably; it is obviously wise to start with 
a fairly high norm and a fairly low weight. By analysing the 
differences obtained between the net migration rates 
obtained from the new source and the synthetic rates, a 
better norm can be determined. The weight can then be 
adapted accordingly, using (for lack of anything better) an 
assumed relationship of quasi-proportionality between the 
weight and the inverse of the square of the norm. 
Obviously, this process can be iterated, with the parameters 


oF 


of the other sources also being changed as required. 
However, the tests conducted at the departmental level on 
the period 1982-1990 appear to show that the overall 
performance of the system is not highly sensitive to changes 
— even sizable ones — in the initial weights; it is therefore 
not necessary to determine these weights with great 
precision — nor, indeed, is it possible to do so — before the 
next census. 


9. CONCLUSION 


The “multi-source” population estimation system presen- 
ted here is robust and flexible, without being overly 
complex. It can function with a variable number of sources. 
To integrate a new source into it, no long historical 
observation period is required. Aberrant data are detected 
automatically and corrected, so that they do not distort the 
estimates. The experiments carried out, while still not 
numerous, indicate that this system is effective. After a 
debugging and break-in period, it should be possible to use 
the system in production without too many risks pending 
the results of the next population census, planned for 1999. 
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An Adaptive Procedure for the Robust Estimation of the 
Rate of Change of Investment 


PHILIPPE RAVALET' 


ABSTRACT 


The presence of outliers in survey data is a recurring problem in applied statistics, and the INSEE survey on industrial 
investment is not immune from this. The forecasting of the rate of growth of capital investment expenditures in industry 
therefore comes down to robust estimation of a total in a finite population. The first part of this article analyses the estimator 
currently used in the Investment Survey. We show that it follows a strategy of reweighting the linear estimator. But the strict 
dichotomy imposed between outliers — all assumed to be nonrepresentative — and other points is not fully satisfactory from 
either a theoretical or a practical standpoint. These flaws can be overcome by adopting a model-based approach and 
estimating by GM-estimators, applied to the case of a finite population. We then construct a robust adaptive procedure that 
determines the appropriate estimator on the basis of the residuals observed in the sample in cases where the residuals may 
be assumed to be symmetrical. Lastly, this method is applied to the data from the Investment Survey for the period 1990- 


LOS: 


KEY WORDS: Economic surveys; Outliers; Robust estimation; GM estimator; Adaptive procedure. 


1. INTRODUCTION 


Since 1952, the Institut National de la Statistique et des 
Etudes Economiques (INSEE) has been conducting an 
investment survey that provides estimates of the future 
trend of capital investment expenditures in industry, well 
before the National Accounts are released or the findings of 
exhaustive surveys are published. The estimation of the rate 
of investment growth is based on the declarations of some 
2,500 company heads concerning their intentions to 
purchase capital goods. 

The almost systematic presence of outliers in these data 
is a major problem. Outliers can seriously distort the 
estimate of the average growth rate and lead to unac- 
ceptable results. According to Chambers (1986), two types 
of outliers may be distinguished. Nonrepresentative points 
designate either measurement errors, which survey staff 
Strive to correct during data collection, or unique 
individuals in the population. By contrast, representative 
outliers designate individuals which, while somewhat 
unusual, cannot be considered exceptional. There are 
undoubtedly similar individuals in the population not 
questioned, and the information that they contain must be 
integrated into the estimate. 

The problem posed here is that of robust estimation of a 
total in a finite population with auxiliary information, a 
problem to which theory provides no definitive answer. 
Nevertheless, various techniques, reviewed in Lee (1995), 
can be applied. The estimation method currently used in the 
Investment Survey follows the logic of reweighting the 
linear estimator, following Hidiroglou and Srinath (1981). 
However, the identification and treatment of outliers are not 
entirely satisfactory. In particular, all outliers are assumed 
to be nonrepresentative, and the dichotomy between 


“normal” points and outliers makes the estimation quite 
sensitive to the choice of outliers. 

The introduction of a linear superpopulation model, 
which describes the change in investment at the level of 
individuals, enables us to better assess the unusual nature of 
an observation and determine how representative it is. Its 
estimation by means of GM-estimators is then an attractive 
alternative to the least squares method, whose absence of 
bias is quite costly in terms of variance. The adjustment of 
the weight function depends at the outset on characteristics 
of the population according to criteria now well described 
in the literature. Since these characteristics can change not 
only from one stratum to another but also over time, the 
significance of an adaptive procedure is obvious. On the 
basis of a first robust estimate, we determine the appearance 
of the distribution of residuals, and then we choose the 
estimator to be used according to a predefined rule. 
Following Hogg, Bril, Han and Yul (1988), we construct an 
adaptive procedure based on indicators of tail weight and 
concentration estimated from the sample, since the residuals 
are not expected to be asymmetrical. This procedure is 
applied to the data from the Investment Survey for the 
period 1990-1995. 


2. ESTIMATOR FOR THE INVESTMENT 
SURVEY 


2.1 Estimation Principle 


In a finite population U={l1,...,N}, which here 
represents a stratum of the survey, a sample s = {1,...,} 
of size n, is drawn, and s = {n+ 1,..., N} designates the 
population not questioned. Each company is questioned on 


! Philippe Ravalet, Division des enquétes de conjoncture, INSEE, 15 Bd. G. Péri, BP 100, 92244 MALAKOFF CEDEX. 


100 Ravalet: A Procedure for the Estimation of the Rate of Change of Investment 


its investment expenditures for two consecutive years ¢- 1 
and ¢, denoted respectively x and y. 

Knowing the total amount X of investments for year 
t— 1 inthe population, we can deduce from the estimate Y 
of total investments for year ¢ the average rate of change of 
equipment expenditures between ¢- 1 and t: 

joes 
AG 


To simplify the notations, we define the parameter 
© = 1 +6 = Y/X, estimated by © = VIX. 

The estimator currently used in the INSEE survey draws 
on the ratio method, with the level of investment in ¢- 1 as 
auxiliary information: 


patna = y;- 


This estimator may be written as a weighted linear 
estimator: 


Yat =) W)2;- (1) 


In this expression, w,=Xx,/)),x, is the weight of 
individual i and z,=y,/x, is the annual change in its 
investment. Such an estimator will be sensitive to the 
presence of outliers on both z and w. An atypical point will 
exhibit a change z that is very different from that of the 
others, while an influential point will have a weight w that 
is large enough to attract, by leverage, the average rate of 
change of the stratum towards its own rate of change. Since 
the decisive criterion for characterizing an observation as an 
outlier is that the product wz is large enough to distort the 
estimate Y ratio? the distinction between atypical points and 
influential points is, of course, arbitrary. The generic term 
large investors (or LI for short) will designate these outliers 
as a group, while the term extrapolatables will refer to the 
other individuals in the sample. 

Having carried out an a posteriori partition of the sample 
s = {LI} u {extrapolatables}, we estimate the total invest- 
ments of the rest of the population s on the basis of the 
behaviour of only the extrapolatable individuals according 
to the ratio method: 


i a 2 yj pz *) Ree ; (2) 


In (2), the weight of the _ extrapolatables 
1 + )'-x)/) cextra)%; 18 quite strictly greater than the weight 
of the ‘large investors, which is equal to 1. 


2.2 Selection of Large Investors 


The large investors are selected within each stratum on 
the basis of their influence on the estimation of © according 
to an iterative procedure. At the outset, all individuals are 


assumed to be extrapolatable, and for each of them we 
calculate a not-taken-into-account index, measuring the 
impact on ot of its exclusion from the sample, 
NTIA = (Y,,- ¥)))/X where oe is the estimated total 
without ogividual i. 

The firm with the largest NTIA index in absolute value 
is said to be a large investor. Y,, is then re-estimated with 
this new partition of U, and then the next large investor is 
identified. The selection stops when all extrapolatable 
individuals’ have an influence on the estimate that is below 
a given threshold. The greater the number and mass of 
observations, the easier it is to verify this condition. 
Conversely, it will prove impossible to verify the condition 
if the number of individuals is too small; in that case, the 
survey manager merely makes sure that no individual has a 
much greater influence than the others, thus introducing an 
element of subjectivity into the procedure. 

By this iterative mechanism, the usual phases of 
detection and treatment of outliers are carried out 
simultaneously. The main problem is that the status of an 
individual is not an intrinsic characteristic but instead 
depends on the composition of the sample. This can change 
from one survey to another. In addition, in certain 
hypothetical cases (Ravalet 1996), this procedure can lead 
to the unnecessary exclusion of some individuals, since at 
no point is the status of large investor called into question. 


2.3 Strategy for Reweighting the Linear Estimator 


The estimator LI in fact follows from the strategy for 
reweighting the linear estimator (1) presented by Hidiroglou 
and Srinath (1981) using the example of estimation of a 
total without auxiliary information. Having already carried 
out a partition s = s, Us, of the sample distinguishing the 
outliers s, (aumbering n,) from the other observations s,, 
the authors propose to reduce, in' Y= (N/n)¥\y;, the weight 
N/n of the outliers to a lower value A by positing 


N-in 


a oe rarer 7aNi 


Dae: 


15 nh, s, 


The optimal value of 4 that minimizes the mean square 
deviation of this estimator, whether or not conditional on 
the number of outliers in the sample, depends on several 
parameters of the population. Without prior information, 
the choice of 4 is a delicate one. 

Applied to the case of the estimator of the ratio with 
auxiliary variable x, this is written as: 
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DY, 


Fraion = 2 Ys * D2 8 5 


=) 


y Aes Oy ee eae aa Me ata(3) 
haga sen 9 Wee 


The first two terms of the second member of (3) form an 
estimate of the total Y, under the implicit hypothesis that all 
outliers are in the sample, and the third is a correction 
taking account of the possible presence of outliers in the 
population not questioned. This correction is a function of 
the 2 selected and the difference in average behaviour 
between the two types of individuals estimated in the 
sample. 

When (2) and (3) are considered together, it may be seen 
that the estimator LI is formally equivalent to the case 
X=1. The use of Y,, thus implicitly assumes that the 
outliers have been correctly identified and are all non- 
representative. In Ravalet (1996), it was shown that these 
two hypotheses were unfortunately seldom verified in the 
context of the Investment Survey. 

Since the identification procedure is manual and the 
criterion used is relatively ad hoc in the absence of any 
hypothesis on the population, it is not impossible that some 
outliers will escape selection. The use of the ratio on the 
extrapolatables then poses the problem of the robustness of 
the estimation in relation to the choice of large investors. In 
addition, it is unlikely that all these points are unique. The 
atypical points, which are especially numerous among small 
and medium-sized firms, should instead be considered as 
representative. However, choosing 4 > 1 would inevitably 
raise the question of the robustness of the third term of (3). 

To try to compensate for these defects, changes to the 
estimator Y,, are possible. For example, the mean of the 
extrapolatables may be replaced by a more robust estimator, 
and only the nonrepresentative points are designated as 
large investors. This technique fits into the more general 
framework of M-estimators, in which the existence of a 
model facilitates both the detection and treatment of outliers 
(Lee 1995). It is then no longer a matter of constructing a 
strict dichotomy between outliers and other points but 
rather of defining areas of varying representativeness. 


3. ROBUST ESTIMATION BY 
GM-ESTIMATORS 
3.1 The Linear Model and GM-Estimators 


Assume the existence of a linear model € that links 
together, for the overall population U, investments x and y 
on dates ¢- 1 andt. 


S:y, = Bx, + 
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with 
E(e,) =0 
E(E;€;) =0 
V(E,) = 0 H(x,) 


Vir], 


Slope B of the regression line passing through the origin 
in the superpopulation model is interpreted as the rate of 
change @ in the population. The variance of y is assumed to 
be an increasing function of x and n is generally a power 
function: n(x,) =x; 

According to the model, the best unbiased linear 
estimator (Brewer 1963 and Royall 1970) of the total is 
Yne = s¥i * Bmediex; where B= (Y,x,»,/n(x,)/ 
(Y.,x7/n(x,))"! is the least squares estimator. 

In the particular case n(x) =x, this expression reduces 
to B._.=)y,/¥,x,. estimator of the ratio. This unbiased 
estimator is effective only under the hypothesis of normality 
of the residuals, and it does not prove to be very robust. 

The M-estimators (Huber 1981) serve to define a robust 
version of the least squares by replacing the square 
function, in the minimization program, with a function p 
that increases less rapidly: 


sa Xe 
Min >> p sila al Br ; 
5 oyn(x;) 
The M-estimator B rp 18 the solution of the following implicit 
equation: 
¥— Ber] 


ay 
* l ofmay} yn 


=0 


where 


_ 9p) 
AC) ae 


The function w, like Huber’s function w(f) = 
Max(- c, Min(f, c)), depends on one or more adjustment 
constants c controlling the portion of observations that must 
be considered as outliers. This estimator will still be 
sensitive to the effect of outliers on the explanatory variable 
x. Therefore a more general class of estimators, called GM- 
estimators (Hampel, Ronchetti, Rousseeuw and Stahel 
1986), is defined by means of the following implicit 
equation: 


Se pftleee (2 ( ation) a a 
s 64/7 (x,) G oyn(x;) n(x;) 


with 
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A choice usually made is Mallows’ formulation: v(¢) = 1 
and w(f) = 1/t. Hence a robust estimator B, will verify the 
implicit equation 


-f a 
¥ yf Ae] =o, (4) 


s on (x;) 


In general, the parameter o is unknown and must be 
replaced in this expression by a robust estimate 6 of the 
dispersion of the residuals 


wpe 
Sy y i Ie 
: 511 (x;) 


The estimator of the total will then be: 
GSD I SAD OF (5) 


This estimator is studied by Gwet and Rivest (1992). In 
general, it is not unbiased in relation to the sample design. 
Chambers (1986) proposes to correct that bias by intro- 
ducing into (5) a third term that estimates it robustly: 


ee =), i Bae x; ¥ 


i€s ies: 


x,/6/n(x;) 


ies YY x, 18°11 (x)) 


iS 


Y,~ Brx; 
es ie 


6yn(x,) ies d 


Choosing a bounded function y, seems a good 
compromise between estimator bias and variance For 
example, Welsh and Ronchetti (1994) opt for a Huber’s 
function with a large adjustment constant c = 15. But the 
adjustment of y,,, without prior information on the density 
of the outliers, is always difficult. 


3.2 Choice of Estimator 


The desirable properties of y functions are now well 
known with reference to the problem of estimating a central 
tendency. They must be bounded, continuous, and 
equivalent to an identity in the vicinity of zero. Strictly 
monotone functions (Huber) are distinguished from 
redescending functions such as Tukey’s biquadratic 
function, Andrew’s sine and the Hampel or Cauchy func- 
tion. Because their influence function tends toward zero, 
these estimators will be less sensitive to the presence of 
outliers than the Huber function. The speed of convergence 


toward zero is an essential characteristic of redescending 
functions. Those that are nil at a finite distance (Hampel, 
Tukey or Andrew) exclude outliers from the estimation of 
B, whereas the others assign them low representativeness. 

The choice and adjustment of the y function are difficult. 
They greatly depend on the nature of the data and more 
specifically on the distribution of the residuals (Hoaglin, 
Mosteller and Tukey 1983, Ch. 11). An idea, however 
approximate, of the appearance of the distribution of the 
residuals should make it possible to better target both the 
choice and the adjustment of the estimator, and hence to 
make the estimation more efficient. This intuitive remark is 
at the origin of adaptive procedures, presented in particular 
by Hogg (1974) and (1982). The idea is to evaluate the 
nature of the distribution of the residuals, calculated on the 
basis of an initial robust estimate (of the norm L, type, for 
example), using carefully selected robust indicators (tail 
weight, asymmetry, concentration, efc.). The existence of 
these indicators makes it possible, using a predefined 
decision rule, to select the appropriate estimator for this 
situation, and the implicit equation (4) is solved by taking 
the first robust estimate of B as an initial value. 

The idea of an adaptive procedure appears all the more 
attractive since it systematizes the study that must precede 
the choice and adjustment of an estimator. That study can 
prove extremely costly if it must be performed manually for 
each stratum of the sample and repeated for each survey. 


4. CONSTRUCTION OF AN ADAPTIVE 
PROCEDURE 


This section describes the construction of an adaptive 
procedure for calculating the average rate of change of 
investment on the basis of economic survey data. 
Consequently, certain choices were made in light of the 
specific nature and characteristics of those data and are not 
necessarily transposable to other regression models. In 
particular, after checking the data, we adopted the 
hypothesis of a symmetrical distribution of residuals and we 
excluded the case of light-tail distributions. 

The construction of an adaptive procedure, which draws 
on the works of Moberg, Ramberg and Randles (1980), is 
carried out in several stages. The first step is to choose the 
y function (or family of functions) to be used. The second 
is to select the various criteria for characterizing the 
distribution of residuals. Using these criteria, a 
classification rule is constructed. Finally, each class is 
matched with the adjustment of the estimator to be used. 


4.1 Choice of y Function 


Since Huber-type monotone functions do not provide 
sufficient protection against outliers, only redescending 
functions were considered. Among them, we selected the 
generalized Cauchy function (used in particular by Moberg 
et al. 1980 to approximate generalized lambda functions) 
and the Tukey biquadratic function: 
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These two estimators are quite different in their 
treatment of outliers (see Figure 1). The biquadratic 
function equals zero for longer than the Cauchy function, 
but on the other hand it has a finite rejection point: the 
residuals beyond c*o do not enter into the estimate, 
whereas the Cauchy function assigns them a certain repre- 
sentativeness. The parameter 5 serves, in principle, to 
control the asymmetry of yw according to that of the 
residuals. 


Figure 1. Cauchy and Tukey Functions 


4.2 Parameter of Scale, Calculation Algorithm and 
Selection Criteria 


In general an estimator 6 of dispersion is defined by an 
implicit equation )°x(r,/6) = 0, where y is an even function. 
It is therefore a matter of solving the system of non-linear 
equations in (B, 6) following: 


y,- Be, 
ye y i i = 0 
i 6/n(x;) 
‘ (6) 
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Rivest (1989) offers several examples showing that 
resolving system (6) can pose problems, owing to the fact 
that there may be a number of solutions, even in the case of 
a monotone y function. Following his recommendations, 
we will proceed in two stages. First, the parameter of 
dispersion o is estimated using the median of the absolute 
values (MAD) of the residuals defined on the basis of the 
median of the individual rates of change. Then B is 
calculated by (4) using the value of o found previously. 

For solving (4), we preferred the reweighting algorithm 
to the Newton-Raphson algorithm, since it seems to 
converge more easily, especially when the adjustment 
constant is small. 

Since the effectiveness of an adaptive procedure depends 
on the effectiveness of the decision-making process, the 
greatest attention must be paid to the nature, quality and 
robustness of the information that guides the choice of the 
estimator. Tail weight is an indispensable indicator, since it 
provides information on the relative significance of outliers 
in the sample and thus in the population (see Hoaglin et al. 
1983, ch. 10). For the tail weight indicator, we adopted the 
proposal of Hogg (1974): 


__ Up)-L) 
&) U(0.5) - L(0.5) 


U(p) (resp. L(p)) is the mean of the np largest (resp. 
smallest) order statistics, using a linear interpolation when np 
is not whole. We chose p = 0.05; for the normal distribution 
t(.05) is equal to 2.59. 

In addition, like Hogg et al. (1988), we considered it 
important to test for the possible presence of a distribution 
of the double exponential type, measuring the concentration 
of residuals by the following pk indicator: 


pee ee a) 

ae Coe eB) Se ( 6825) 
where X (a, b) is the means of the order statistics between 
the na-th and the nb-th, with the sizes interpolated if na or 
nb are not integers. We selected a = 0.05 and B = 0.15, or 
pk = 2.7 for a normal distribution. 

Finally, different studies (Moberg et al. 1980, Hogg 
et al. 1988) have emphasized the importance of the dissym- 
metry of distributions. When there are asymmetrical 
residuals, the bias of robust estimators can be sizable, 
making it tricky to use them (Chambers et Kokic 1993). In 
the INSEE Investment Survey, the residuals are theo- 
retically asymmetrical since they are confined to a limited 
range (r = y- Bx > — Bx). However, we noted empirically 
that this asymmetry was very slight and could safely be 
ignored. The failure of the correction of a possible bias by 
the function y,, in Chambers’ estimator moreover confirms 
this observation. Only the symmetrical case is considered 
here; the bias of the estimators defined by (5) is therefore 
nil. 
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4.3 Classification of Distributions and Adjustment 
of the Estimator 


The definition of the decision rule was based on the 
study of eight specific symmetrical distributions illustrating 
various tail weight and concentration situations (see 
Table 1). We were interested in the family of contaminated 
distributions CN(a,K), with the distribution function 
F(x) =(1 - a) D(x) + aD(x/K) where O is the cumulative 
function of the distribution M(0, 1), since these distri- 
butions give a good representation of real data (Hoaglin 
et al. 1983, ch. 10), especially the data in the Investment 
Survey (Ravalet 1996). While Gaussian in the middle, they 
nevertheless contain more outliers than the normal 
distribution (0, 1). 


Table 1 
Eight Specific Distributions 
t(.05) pk 
1 Normal distribution -a}3) 2.76 
2 Contaminated dist CN(.05, 3) 2.94 2.83 
3. Double exponential dist. 3.28 3.41 
4 Contaminated dist CN(.05, 10) 4.47 2.85 
5 Contaminated dist CN(.10, 10) 5.42 3.05 
6 Contaminated dist CN(.20, 10) 5.64 4.44 
7 Slash distribution 7.65 4.19 
8 Cauchy distribution 7.82 4.78 


The two indicators t(0.5) and pk were simulated over 
these eight distributions, for several sample sizes. The 
graph of (t(0.5),pk) serves to distinguish four groups of 
distributions: light-tailed, relatively unconcentrated distri- 
butions of the normal type or CM(.05,3); heavy-tailed 
distributions of the type CN(.05,10), CN(.10,10), and 
CN.20,10), and very heavy-tailed distributions of the Slash 
or Cauchy type; and concentrated distributions such as the 
double exponential distribution. These four classes are 
defined (see Figure 2) by the following equation 
boundaries: 


Classi 110.5) =3.0r— if and pk < 3.20 
n 


Class I: 3.6 - 4 <1(0.5) <5.8- 2 
n n 
Class III: 5.8 - I <0 ( 0.5) 
n 


Class IV: 1(0.5) <3.6-—2* and pk>3.20 
n 


x 
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x 
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Figure 2. Four Classes of Distributions 


The final stage consists in setting the adjustment of the 
two estimators in each class. Since we are interested only in 
the symmetrical case, the b parameter of the Cauchy 
function is nil. By simulations, we determined for the eight 
reference distributions the optimal constants c of the Tukey 
and Cauchy functions (i.e., minimizing the variance of these 
estimators or, what amounts to the same thing here, their 
mean square deviation). These do indeed diminish with tail 
weight, except of course for the case of the double 
exponential distribution, which requires an adjustment 
similar to those used for the Slash and Cauchy distributions. 

Tukey’s estimator is more efficient on the normal or 
contaminated distributions, but it generally requires finer 
adjustment. Figure 3 shows the example of the contam- 
inated distribution CN(.10,/0). Lastly, while the choice of 
the constant appears to be relatively critical for the heavy- 
tailed or concentrated distributions, a wide band of value is 
possible for distributions close to the normal distribution. 


Estimate Variance 


Adjustment constants 


Figure3. Variance of Tukey and Cauchy Estimators for the 


Distribution CN(.10,10) (n=100) 
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The synthesis of these results serves to define the 
adjustments to be used on each distribution class. These 
adjustments, established for samples of size 100 (Table 2), 
remain entirely acceptable for samples sizes between 50 
and 150. 


Table 2 
Adjustment of Estimators by Class of Distribution 
of Residuals (n = 100) 


Class Tukey Cauchy 
I 7 7 
II 4.5 4 
Il 3 1 
IV 3 1 


5. APPLICATION TO THE INVESTMENT 
SURVEY 


5.1 The Problem of Stratification 


The strata used for the LI estimator are defined by the 
cross-tabulation of an activity (18 manufacturing sectors) 
and a company size class (small, medium or large). Among 
these 54 strata, approximately 20 never contain more than 
20 observations. This stratification is therefore too fine for 
the adaptive procedure to be used correctly, as it assumes a 
minimum number of observations. 

Since small firms are fairly distinct from medium-sized 
and large firms in terms of dispersion and residuals tail 
weight, differentiation by size is maintained. Sectors must 
thus be grouped. We decided not to adopt the method used 
by Sohre (1995), which consists of grouping after data 
collection those sectors having the closest parameters (here 
the average change in investment). Proximity is impossible 
to assess in small strata, and the groups obtained are likely 
to change from one survey to another, making comparisons 
difficult. We preferred to redefine 15 new strata based on a 
higher classification level distinguishing only four sectors: 
intermediate goods, professional capital goods, automobile, 
and consumer goods. 


5.2 Characteristics of Strata 


The hypothesis of a variance of residuals independent of 
x in the model € cannot be accepted. The choice of y in the 
function n is made in such a way that the curve of the 
residuals (in absolute value) as a function of the regressor, 
smoothed by the LOESS method, shows no trend 
(Cleveland 1979). For the stratum representing intermediate 
goods and medium-sized companies in the April 1995 
survey (see Figure 4), y = 1.3 is an acceptable compromise 
between the appearance of a downward trend for small 
values of x and the cancellation of the upward trend for the 
larger values of x. A similar examination on the other strata 
confirmed this choice for the manufacturing industry as a 
whole. 
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In each stratum, the distribution of the residuals system- 
atically exhibits a heavier tail than the normal distribution, 
without being extremely heavy-tailed. Within a given 
sector, the tail weight indicator decreases with company 
size. The great majority of the strata representing small and 
medium-sized firms were assigned to Class 2. Large firms 
more often exhibit somewhat heavy-tailed distributions, 
close either to the normal distribution (Class 1), or the 
double exponential distribution (Class 4). Class 2 is by far 
the largest and represents 75% of cases. Only 20% of the 
distributions are recognized as somewhat heavy-tailed and 
are assigned in equal proportions to classes 1 and 4. On the 
other hand, very heavy-tailed distributions (Class 3) are 
unusual (less than 5% of the cases). While there appears to 
be a certain persistence to the classification, it is not perfect. 
And the changes are quite real, since they resist a slight 
modification of the boundaries between classes. Thus this 
perfectly justifies the use of an adaptive procedure. 


ll 


Hl 


TT "Ty 
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Figure 4. Absolute Value of Residuals (y = 1.3, Intermediate Goods, 
Size 2, April 95) 


5.3. Resulting Estimates 


The estimation procedure based on (5), applied to the six 
surveys covering the period 1990-1995, yielded the results 
shown in Figure 5. Also shown are National Accounts 
estimates, those obtained with the LI estimator, and those 
fromthe Annual Business Survey (ABS), whichis exhaustive. 

For the manufacturing sector as a whole, the results of 
the adaptive procedure are comparable to those obtained 
with the LI estimator. The biquadratic function results in 
estimates that are consistently lower than those obtained 
with the Cauchy function. With a finite rejection point, the 
Tukey function is less influenced by the slight asymmetry 
toward the right in the distribution of the residuals. These 
new estimates are closer to those of the ABS than to the 
National Accounts estimates. This is hardly surprising, 
considering the excellent correlation between individual 
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ABS data and the responses obtained in the survey. As yet 
there is no explanation for the differences in 1991 and 1994 
in relation to the National Accounts estimates. Apart from 
the year 1994, the estimates obtained with the Cauchy 
function are entirely acceptable in the intermediate goods 
and automobile sectors and to a lesser extent in the 
professional capital goods sector. On the other hand, in 
consumer goods, the results are fairly far from the National 
Accounts estimates. Here we are likely running up against 
a problem of sample quality. This sector is quite hetero- 
geneous, and a few activities such as printing are poorly 
covered by the survey. 


& Accounts -~@- Cauchy ~{3}- Tukey ot ul 


MA ABS 


1990 1991 1992 1993 1994 1995 


Figure 5. Investment Growth Rate in Value in the Manufacturing 
Industry 


6. CONCLUSIONS 


This article presents a theoretical justification of a 
procedure currently used to process data from the 
Investment Survey; in particular it offers a justification of 
the principle of excluding outliers or large investors. 
However, the strategy of reweighting the linear estimator 
following Hidiroglou and Srinath (1981) shows itself to be 
insufficient for this purpose in several respects, mainly 
having to do with the identification and treatment of 
representative outliers. The dichotomy between extra- 
polatable individuals and large investors appears too radical 
and leads to a lack of robustness, since the influence curve 
of this estimator is not continuous. 

On the other hand, the hypothesis of a linear super- 
population model and its estimation by GM-estimators 
seemed to us to be of great interest from both a method- 
ological and practical standpoint. The insertion of these 
techniques into an adaptive procedure also makes it 
possible to have a robust estimator for a variety of situa- 
tions. Following principles described in the literature, the 
procedure proposed here uses indicators of tail weight and 
concentration of the residuals in the linear model calculated 
from the sample, to decide on the adjustment of the weight 
function to be used, it being assumed that the residuals are 


symmetrical. The estimates made with the Cauchy function 
yielded satisfactory results on the manufacturing industry, 
and they largely validate previously published results. The 
advantages of this method over the one currently used 
basically have to do with lower implementation costs and 
greater control over the methodology employed. 

The adaptive procedure was constructed independently of 
the survey, and therefore there is no guarantee that the 
classification is optimal for the strata content. Furthermore, 
we did not study the robustness of the rule for assigning values 
toaclass. This issue is important when one carries out several 
successive measurements and one wants to interpret the 
revisions. Clearly, further research on these classification 
methods is required, in order to integrate additional 
information such as the information yielded by earlier 
estimates orcomprehensive surveys of the population studied. 
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Sampling and Maintenance of a Stratified Panel 
of Fixed Size 


F. COTTON and C. HESSE' 


ABSTRACT 


Statistical agencies often constitute their business panels by Poisson sampling, or by stratified sampling of fixed size and 
uniform probabilities in each stratum. This sampling corresponds to algorithms which use permanent numbers following 
a uniform distribution. Since the characteristics of the units change over time, it is necessary to periodically conduct 
resamplings while endeavouring to conserve the maximum number of units. The solution by Poisson sampling is the 
simplest and provides the maximum theoretical coverage, but with the disadvantage of a random sample size. On the other 
hand, in the case of stratified sampling of fixed size, the changes in strata cause difficulties precisely because of these fixed 
size constraints. An initial difficulty is that the finer the stratification, the more the coverage is decreased. Indeed, this is 
likely to occur if births constitute separate strata. We show how this effect can be corrected by rendering the numbers 
equidistant before resampling. The disadvantage, a fairly minor one, is that in each stratum the sampling is no longer a 
simple random sampling, which makes the estimation of the variance less rigorous. Another difficulty is reconciling the 
resampling with an eventual rotation of the units in the sample. We present a type of algorithm which extends after 
resampling the rotation before resampling. It is based on transformations of the random numbers used for the sampling, 
so as to return to resampling without rotation. These transformations are particularly simple when they involve equidistant 


numbers, but can also be carried out with the numbers following a uniform distribution. 


KEY WORDS: Panel; Stratified sampling of fixed size; Stratified simple random sampling; Maximum coverage; 


Sample rotation; Equidistant numbers. 


1. INTRODUCTION 


We consider the successive selection of samples 
intended to follow the change over time of sums of 
variables, more generally functions of sums, in a 
population. For example, this may be a population of 
businesses or establishments for which we wish to follow 
monthly sales trends. The ideal would be to be able to 
conserve a constant sample, but demographic movements 
make this impossible and it may not be desirable in light of 
the survey response burden. 

The methods for selecting units presented in this article 
are subject to the following three constraints: 

Firstly, it is necessary to regularly introduce births and to 
take deaths into account. 

Secondly, sampling involves characteristics of units 
which change over time, such as the size or primary activity 
of businesses. These characteristics can be used to modulate 
the probabilities of inclusion. Notably, it is often prudent 
to increase these probabilities with the size of the units if 
we estimate sums of variables correlated with this size. In 
addition, these characteristics may eventually be used as 
stratification criteria. In this article, a stratum will mean a 
subset of the population within which the sampling is of 
fixed size, to the nearest rounded digit. However, the 
criteria used in the stratification of the first sampling, such 
as the primary activity of the unit, become “inexact” or 
become less and less correlated with the variables of 
interest such as size. This results in a progressive increase 


in the variance of the estimates. To remedy this, it is 
appropriate to carry out a resampling of the sample from 
time to time after updating the stratification and calculating 
new probabilities of inclusion. This must be done while 
endeavouring to conserve the maximum number of units. 
However, fatally, units will be excluded and others will be 
introduced, mainly because of changes in the probabilities 
of inclusion, although this would also happen because of 
the changes of strata, even if the probabilities of inclusion 
remained constant. 

Thirdly, we would like to distribute our survey response 
burden over a larger number of units. We determined a 
maximum duration limit for inclusion in the panel. Beyond 
this limit, the unit is replaced by another unit chosen from 
those which have never been included, or which have been 
absent the longest. We call this change of the sample over 
time rotation. It is generally slow and regular. The various 
methods for performing this rotation are well known in 
statistical agencies. They consist mainly in attributing, at 
the beginning, a permanent random number to each unit of 
the population. The successive samples are defined by 
intervals over these numbers or by the ranks induced by 
these numbers. 

We call the chronological sequence of samples resulting 
from these updating operations a “panel” and the set of 
updating operations “maintenance” of the panel. 

The maintenance scheme presented in this article is 
analogous to that of Hidiroglou, Choudhry and Lavallée 
(1991). It corresponds to a frequency of updating of the 
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stratification and probabilities which is significantly less 
than the survey frequency. This is generally the case for 
surveys with an infra-annual periodicity. The speed of 
demographic movements is not considered large enough to 
make it worthwhile to reselect the sample every time. The 
rotation is carried out without changing the probabilities of 
inclusion and the strata between two resamplings and it is 
regularly spread over time to conserve a certain continuity 
of the quality of the estimators of change over time. This 
also corresponds to a duration of inclusion of which the 
expected value is constant. In certain algorithms, we could 
determine a constant duration between two resamplings; 
otherwise we could set an upper limit. The speed of 
rotation represents a compromise between the efficiency of 
the estimators of change over time, which is greater the 
lower the rate of renewal, and the concern not to keep a unit 
in the panel for too long. Note that the quest for maximum 
coverage in the resampling remains meaningful with the 
rotation: we first remove the fraction to be renewed as if 
there were no resampling, then we seek the maximum 
coverage with the residual portion. 

We will examine several methods of panel maintenance, 
with emphasis on maximizing sample coverage during 
resamplings. We will distinguish more particularly a 
process which assigns equidistant numbers to the units 
before each change of stratum. 


The article is divided as follows: 


After reviewing definitions and describing a few 
notations in section 2, we briefly indicate in section 3 how 
Poisson sampling makes it possible to carry out the 
previous maintenance scheme simply and perfectly. This 
sampling has the disadvantage of being of random size, but 
it serves as a reference for the stratified sampling of fixed 
size which we then consider. 


In most instances, in these samplings, we determined 
probabilities of inclusion at the outset and used a rounded 
number to determine an entire sample size in each stratum. 
This problem, examined in section 4, is not negligible when 
the strata are small, which can occur for strata of births. In 
addition, rounding is used in the method which we propose 
to maximize the coverage after resampling. 

Section 5 deals with the maximum coverage of samples 
of fixed size. First, we review two known methods: that of 
Kish and Scott (1971) and another based on the attribution 
to each unit of permanent independent numbers following 
the uniform distribution. The Kish and Scott method (1971) 
seems poorly suited to an intermediate rotation between 
resamplings. The other method, which reproduces simple 
random sampling in each stratum, does not have this 
disadvantage, but the coverage is less than with the Kish 
and Scott method (1971). Finally, we propose that the 
numbers be equidistant before resampling. We then obtain 
the same coverage as with the Kish and Scott method 
(1971), at least in the case of proportional distribution, 
while facilitating intermediate rotations. However, the 
coverageremains less than the maximum theoretical coverage 
which we obtain, forexample, with Poisson sampling. 


In sections 6 and 7, we present the intermediate phases 
of updating births and deaths and of rotation. 

To conclude the topic of maintenance, we show in 
section 8 how resampling can take place between two 
phases of rotation. We present a type of algorithm which 
extends after resampling the rotation before resampling. It 
is based on transformations of the random numbers used in 
the sampling, so as to return to resampling without rotation. 
These transformations are particularly simple when they 
involve equidistant numbers, but can also be carried out 
with the uniform beginning numbers if we wish to continue 
with simple random sampling. 


2. REMINDERS, DEFINITIONS AND 
NOTATIONS 


Let there be a population, or finite set of units 
ie U = {1,...,N} where Nis the size of the population. 

We consider only samples without replacement. A 
sample is then simply a subset s of U. We call sample size 
the number 7 of units which it contains. 

A sampling or selection plan is a discrete probability 
p(s) over the set of samples. 

We can generalize to joint sampling of several samples. 
By limiting ourselves to two samples s,,s,, the joint 
sampling is the probability p(s,,s5,) over the set of pairs 
($5, 55): 

The first-order probability of inclusion of an individual 7 is 
defined by: 

T= > p(s). 
E(.) being the expected value with respect to the sampling, 
this yields: 


B= ao bhi 


ieU 


In the case of two samples with first-order probabilities 
of inclusion 7, ;, 1,5, we can define the joint probability of 


. . 1,2? 
inclusion: 


Fi 1.9. » P(S1) Sp). 


$13 Te S43 I 
This yields the constraint: 


T, 1 2 < Mmin(Z, |, 7). (2.1) 


If ies,, the probability of reselection in s, is 
7j,1,2/Tj,1 BS min(1, T, 5/7, 1). 

In Poisson sampling, the selection of the units is 
independent and the sample size is random. Except in 
section 3, we will instead consider sampling where the size 
is fixed to the nearest rounded digit. 

Simple random sampling (SRS) is sampling of fixed size 
where the samples are equiprobable. This yields x, = n/N. 

The population is partitioned into strata U,,h =1,...,H 
of sizes N,. In this article, we will call a set of H 
independent samples of fixed size n, in each stratum 


2 
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“stratified sampling of fixed size” and we will limit 
ourselves to samplings with a uniform first-order proba- 
bility of inclusion in each stratum. We will then use the 
notation f, =7,. We will call a stratified sampling of fixed 
size with simple random sampling in each stratum 
“stratified simple random sampling” (SSRS). 

We will call the number of consecutive surveys where a 
unit is included in the panel “duration of inclusion of a 
unit.” We will notate it D,, or D, in the particular case 
where it is the same for all units of a stratum h. When 
m, 2 0.5, this duration cannot be less than N= AC) Bor, 
example, if , = 0.7, the duration of inclusion is at least 3. 
In practice, we will not rotate units whose z, exceeds a 
certain threshold. 

In addition, the previous variables are indexed by survey 
wave t. The population U, of size N, and the sample s, of 
size n, vary because of births and deaths, and the sample 
also varies as a result of the stipulated rotation. Moreover, 
we will consider samples at particular times ¢ =, of the 
first sampling and ¢ = ¢, of the first resampling. For the 
sake of simplicity, they will be notated s,,s, instead of 
Sty Sty. The algorithms described for the pair (s,, 2) will be 
valid for the following resampling pairs. 


3. SOLUTION BY POISSON SAMPLING 


It is enlightening to examine how we can observe the 
panel maintenance scheme by Poisson sampling. This is 
the model which we will endeavour to approximate in order 
to choose a selection method. 

We attribute to each unit /, at its birth, a number which 
is arandom number @, selected according to the uniform 
distribution in [0,1). It is implicit in the formulae where 
these numbers appear that the results of the operations are 
modulo 1. 

During the first sampling, at date t = t,, we select the 
units such that @, belongs to the interval [0, ™, ,) where 7, , 
are the probabilities of inclusion given. In the absence Of 
rotation, we keep this interval at the following dates until 
resampling. Births as well as deaths are distributed at 
random in this interval. The resampling, at date ¢ = ¢, is 
carried out by selecting the units of the interval [0, T; >) 
where 7, are new probabilities of inclusion. The joint 
probability of inclusion is equal to the length of the 
common interval, i.e., min(7, ,, TC, y) which is the maximum 
theoretically possible according to the formula (2.1). The 
expected value of the coverage is therefore itself maximal. 

Let us now consider a rotation between the sampling and 
the resampling. We maintain the probability z, , and we can 
determine a duration of inclusion D, P which is variable 
depending on the units, but fixed until ‘the resampling. This 
constraint is realized by defining the sample at date 
t(t, <t<t,) by the interval 


ASS tite t)%,,/D, 1 G— t)m, /Di, + Ti, 1). 
The rate of rotation is a random variable. Its expected 


value results from D, ,. It is equal, for any subset V of the 
population, to Y ev(™, JD, Ub brs 
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At the first resampling at date ¢ = f 
sample by 


> we could define the 


MEL, ~ 4) 0, /Di 1 ( ~ 4)%,/D,1 + 2) 


However, we encounter a difficulty for units such that 


1 
Mia Ty Ms alk , 
il 


and if w, belongs to the interval 


1 
(>t) /Di +H Q-~ 4) /Di1 “sf 6 at | 


These units, which were previously in the sample, are 
excluded but will be reincluded in a future rotation. If we 
wish to avoid this, we must make the limit of the new 
interval coincide with that of the old interval, and the 
sample at date ¢ = ¢, is finally defined by: 


@,E[4,4,4;, + M,>), 


where: 


a 


‘i (t, — t,) 0, /D,, + mayo sf! = = = TEN 


ii 


The joint probability of inclusion is equal to the length of 
the common interval, i.e., 


ni | 1- x ; na : 
il 


This is also the maximum compatible with the rotation. 
If we continue the rotation with durations of inclusion 
D,, the interval at date t >t, is: 


[41 + (t- t)t,/Di 9, 4,1 + E- £)%/D) 2 + Ti»): 


Poisson sampling controls exactly the duration of 
inclusion and maximizes, as an expected value, the 
coverage during resampling but with the disadvantage of a 
random sample size, regardless of the subpopulation. In the 
following pages, we will endeavour to devise algorithms 
similar to those just described for Poisson sampling in order 
to apply them to stratified sampling of fixed size. We will 
try to control the duration of inclusion in the rotation, as for 
Poisson sampling, and to approximate the same rate of 
coverage during resampling. We will begin with the 
problem of coverage during resampling in section 5, but 
first, it is useful to clarify certain concepts concerning the 
rounding of sample sizes by stratum. 
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4. ROUNDING OF SAMPLE SIZES 
BY STRATUM 


This problem is related to the estimation formulae. These 
formulae use the first-order probabilities of inclusion, either 
in the unbiased Horvitz-Thompson estimator or in adjusted 
estimators. Let f, be the probability of inclusion by 
stratum, and let v, = N, f,. We must have a whole number 
n, per stratum. An initial method for accomplishing this 
consists in restricting the choice of the f, in such a way 
that v, is an integer. In each stratum where we would have 
had v,< 1, we must take v, = 1 so that Si >0. However, 
if the stratification is very fine vis-a-vis the sample size, this 
occurs in numerous strata. This makes it necessary either 
to increase the sample size or to decrease the sampling rate 
in the other strata, to the detriment of efficiency. 

We will use a second method, which consists in linking 
the probability f, more loosely to n,. We apply a rounding 
process such that E(n,)=v,, where v, is no longer 
necessarily an integer. 

Let us assume that /(.) is the integer part function. We 
must have 


h 


Pree T yy rel = kp: 
Pig, =1@ la, 


where g, =v, — I(y,). 

It is then no longer necessary that n, >0 in order for 
f,>9. Note that the first method can be considered a 
particular case of the second. This rounding can be done 
independently by stratum, in a linked way by systematic 
rounding or by the Cox method (1987). We describe only 
systematic rounding. 

Let us first order all of the strata, and index them by their 
rank. Let c, =0 and c, = ye y.; we select a number @ in 
the interval [0, 1), according to the uniform distribution and 
we take n, = 1(v,) + 1 in the strata such that c,_ ;<m- 1+ 
O<c, for m entirely. 

This implies that 


| (n, ++ n,) - (vy, tit v,) Pal 


for any j,,j, suchas 1 <j, <j, < H. 

In particular, the global size differs by less than one unit 
from its expected value. This is obviously not the case with 
independent roundings. 


5. ALGORITHMS FOR THE MAXIMUM 
COVERAGE OF SAMPLES OF FIXED SIZE 


The maintenance algorithms which we propose are based 
on the attribution of equidistant numbers. This is not 
necessary during the first sampling, nor in the rotation, but 
is used to maximize the coverage during updates of the 


stratification. That is why we examine this maintenance 
phase first. 

Let us begin by describing all the notations and making 
a few useful observations. 

We select a first sample s, stratified according to 
criterion h,. After a certain time has elapsed, we select a 
new sample s, with an updated stratification h,. The 
first-order probabilities of inclusion are respectively /f, ,/, 
and the sample sizes required by stratum are respectively 
Ny Mp, It is sufficient to consider what happens in any 
stratum h, = g. Let sg be the part of the first sample s, in 
this new stratum, of which the size n,, is generally 
random. Let s, , be the part of the second sample s, in this 
new stratum, of which the size is fixed to the nearest 
rounded digit. The size Nei of the coverage cannot 
exceed the limit n, ;. = min(7g.1, 72). We can hope to 
devise sy a resampling process with a uniform first-order 
probability of inclusion in s, , which makes it possible to 
attain this limit, at least when the first-order probabilities of 
inclusion in are also equal to a single value f, =f,. Note 
that, even if this limit is attained, the fixed size constraints 
decrease the coverage. The finer the stratification, the 
greater this effect. In fact, the smaller the population of 
stratum g, the greater the likelihood that the coefficient of 
variation of mg; will be large, as well as the proportion of 
units not reselected in the case ng} > Ng. 

There is an obvious way of attaining the limit ng. Let 
us assume first of all that the first-order probabilities of 
inclusion in s 1 are uniform. If Nay <Ngor We add 
Nar Nyy units to s,, selected at random in the comple- 
ment of § ae If n 1>N,., Weremove n,,- Nyy units from 
Sey selected at random. By construction this yields 
SoS So OF Sg22Sgi, and n, 1, =Ng 12. If the first-order 
probabilities of inclusion in s, , are not uniform, we apply 
the same method within subsets where these probabilities 
are uniform. This is the method proposed by Kish and Scott 
(1971) on page 468 of their article. They do not stipulate the 
procedure for random selection, but we assume that itis SRS. 

As Kish and Scott point out, the second-order 
probabilities of inclusion are not uniform and if the first 
sampling is a SSRS, the second sampling no longer meets 
this definition. The first-order probability of inclusion, 
itself, is not strictly uniform when includes elements of 
strata from the previous sampling: see an example in the 
appendix. However, there is another method which verifies 
this condition. It is well known to statistical agencies which 
practise coordination of samples. For the sake of 
convenience, we will call it “method 1”. 


Method 1: 
Use of independent numbers following 
the uniform distribution 


We attribute to the units, at their birth, @, numbers 
which follow the uniform distribution in [0, 1) and are inde- 
pendent, as in Poisson sampling. The first sample s, is 
obtained by selecting, for example, the n, units of lower 
rank according to @, in each stratum. With this algorithm, 
the maximum coverage is also obtained by selecting the n hy 
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units of lower rank according to ©, in each stratum /,. 
Moreover, it is obvious that these two samplings are SSRS. 

It is also obvious that we cannot obtain greater coverage 
with this algorithm. In addition, we conjecture that it is not 
possible to do better, for SSRS, regardless of the algorithm. 

On the other hand, the coverage is poorer as an expected 
value than with the Kish and Scott method (1971), at least 
in the particular case where the first-order probabilities of 
inclusion in s, are uniform. In fact, at that point the 
relations g5,,°8,, OF 5,525.1, 15 =i, 15h are not 
necessarily true and the loss of cc coverage is greater, the 
smaller the strata during the first sampling. 

We shall demonstrate this, again in the particular case of 
a uniform probability of inclusion f, in s,. Let us assume 
that @, is the greatest value of , for the units of s, in 
stieitiiny Wes and @, the greatest value of OF for the units of s 
in stratum g. Let @, = min(@, ) and @; = =max(@,,). If 
®, < @, then s pee and if @, > @;, then 522 2S¢1- In 
both cases Ng 1,2 = Ng,1,2- The risk of not attaining the tee 
exists only if @, <@, < @;. In this case, the relation 
S¢255Sg1 OF S542 is is no longer necessarily true: see 
Figure 1, where we considered only 2 Strata h,. The loss of 
coverage is greater where the quantity @; - @, is greater as 
an expected value, and therefore where the strata h, are 
smaller. 


Method 2: 
Use of equidistant numbers 


If we accept not to conserve a SSRS, how can we modify 
the previous method to obtain the same coverage as the Kish 
and Scott method (1971), at least when we have the uniform 
probability of inclusion f, in s, ? We have seen that the loss 
of coverage was the result of the deviation between the @ hy? 
It is sufficient to transform the @, into new numbers 7, , in 
such a way that the p hy which correspond to the @ h, are as 
close as possible to a common value, ie., ibe More 
specifically, we would like to have the equivalence: 

LT eS R,, Cye ies n, J} Spine [0 f,,.). 
where R,, @) is the rank according to @, in h, of uniti. A 
solution i is given by the transformation: 


R,,@ ol 


Nn, 


ens 


Pig= (5.1) 


where @, is areal number which verifies: 
1 


0, € L0, Pp)» Ny, =1(v,,) hs 


A, 3 [P, > 1), Nis I(v;,)- 

The transformation therefore involves the rounded 
number of the v 3 examined in section 4. The sampling of s, 
is carried out like that of s, except that the p, , now play the 
role of the @,: in each new stratum g we define rounded 
sizes Ng» and we select the 7 22 units of lower rank 


145 


according to p,,. Note that these ranks are different from 
those induced by O;. 

Let us assume that the probability of inclusion in s, is 

still uniform. Let p be the value of p, , for the unit of rank 
Noo inne lr Ps E[0,f,), then sgo<s,g1. Otherwise 

Soo 25-5. In this particular case, we therefore attain the 
82 7 8, 

maximum coverage Ng. 1,2 aS in the Kish and Scott method 

(1971), and unlike method 1. We illustrate in Figures 1 and 

2 how the transformation into equidistant numbers makes 

it possible to increase the coverage compared to method 1. 

We apply the same algorithm when the probabilities of 
inclusion in s, are not uniform. Unlike the Kish and Scott 
method (1971), we do not need to fix the size of the new 
sample within subsets where these probabilities are 
uniform. This is another advantage and we think that it 
increases the coverage. 

Nonetheless, the coverage obtained by this algorithm 
remains lower, as an expected value, than that of a Poisson 
sampling with the same probabilities of inclusion. In order 
to have, as an expected value, the same coverage as with 
a sampling, it would be sufficient to define s,, by 

,€[0, fg). In fact, we would then have Pr(es,n 5) = = 
hot th Fg), but the sampling so obtained would no longer 
be of fixed size. 

The following resamplings, after new updates, are 
carried out by repeating the process. For example, before 
ear s, we calculate equidistant numbers p,, based on 

, (and not @,) in each stratum h,. 

Piitne resulting sampling plan in the new Strata is no longer 
a SRS. In particular, the probabilities of inclusion of the 
pairs of units vary generally as a function of the former 
strata. In other words, the resampling keeps a “trace” of the 
stratification of the first sampling. Moreover, the proba- 
bilities of inclusion of the units in s,, are not exactly 
equivalent to f, except for the sample defined by 
P1€ 10, Te ). For the sample of fixed size n,, this 
probability varies as a function of the size of the former 
strata. As in the Kish and Scott method (1971), we do not 
strictly control these probabilities. However, the deviation 
between f, and the true probability becomes negligible 
when Ng» IS sufficiently large. 


Note 1. The transformation of numbers which inde- 
pendently follow the uniform distribution in equidistant 
numbers was proposed by Brewer, Early and Hanif (1984) 
as a way of rotating samples in the same manner as Poisson 
sampling, with the advantage of a smaller variance of the 
sample size. However, this transformation is performed by 
taking the set of the population, and therefore they did not 
address the problem of maximum coverage during changes 
of stratum. The numbers change only when births and 
deaths are updated, according to a procedure which is also 
quite different from that which we propose for changes of 
stratum. 


Note 2. In the demonstration we just provided, it is not 
necessary that the numbers be completely equidistant. a 1S 
sufficient that the My, units of s, and the N, 
complementary units have their new numbers Boodlitie® 
in [0, hi,» [ Sy 1). We could attribute these new numbers 
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in such a way that they independently follow the uniform 
distribution in these intervals. 


> 


W=0 (0) 


4 
Ro 
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Figure 1. Coverage with method | (numbers following the uniform 
distribution). 


We have represented the units in g according to the value of the 
number @ (on the abscissa) and the stratum h, of the first sampling 
(on the ordinate). We assume that there are only two strata. The 
circles correspond to s il and the squares to the complementary part. 
The solids correspond to s,, and the blanks to the complementary 
part. The size of s,, was fixed at 9 which defines w,. In this 
example, we see that two units are not reselected (in h ate 1) and that 
another is new (in h, = 2). The size of the coverage is 8, while the 
Kish and Scott method would make it possible to reselect the 9 units 
iN Sy: 


h=l ®@ @@0800 


h=2 C ® @ 


v 


| 
| 
| 
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Figure 2. Coverage with method 2 (equidistant numbers). 


We are in the same situation as in Figure (1), but this time the 
equidistant numbers p serve as the abscissa of the units. This 
equidistance is defined in each of the whole strata h, and the gaps we 
see in the sequence of numbers correspond to the units which are not 
in g. The first sample s,, is composed of the units for which this 
number is less than the probability of inclusion f,, regardless of the 
stratum. The second sample s,, is composed of the 9 units with the 
smallest p and the coverage is 9, as with the Kish and Scott method 
(1971). 


6. UPDATING BIRTHS AND DEATHS 
WITHIN STRATA 


In this section and the following one, we consider the 
stratification (A) without reference to the period. The 
updating of births and deaths within strata is essentially a 
particular case of change of the strata of units. It is exactly 
as if the births entered the strata and the deaths left. We can 
therefore apply the previous methods. Let us take a look, in 
particular, at method 2. 

In a stratum, the population U, , of size N, , varies with 
each updating carried out at time Lae will notate the 
births as B,,,, and the deaths D, ,,, between ¢ and as 


t+ 1, this yields Oy ey phd 


t+] Pees "| =“ipeeale 


We consider the simple case where the probabilities of 
inclusion f,, remain uniformin U, , and constant. The size n, , 
pe the sample s, , is a rounded number to the integer of 

Niewsf, lie numbers p,, change with each updating. Just 
before updating s, ,, leading tosh; 
a) we make equidistant the numbers peeping ps 
b) we attribute equidistant numbers to the units of By rst: 

Let p,, be the number so obtained. An initial solution 
would consist in selecting the n, ,,, units of U, ,,, with the 
smallest p,,. Note that these are no longer equidistant 
because we removed the deaths situated at random. 

However, units with numbers close to f, can leave the 
sample and then return on a future occasion. We remedy 
this by a rightward shift of the selection interval. Let p, , 
be the number of the beginning unit of the selection interval 
for s,, f and p, , that of the unit immediately following the 
end unit of this interval in U,,, Inother words, the sample s, , 
consists of the interval closed to the left and open to the 
right [p, 4,P,-)- Between tand ¢ + 1, the number of units 
of Uy 1.4 belonging to this interval becomes m, ,,,. If 
Ay 112 Mp1. the beginning of the interval for Sptet is 
fixed to the unit of number p, ,, otherwise we shift the 
interval in such a way that its end is the unit of number 
Phe,’ We therefore have a slight involuntary rotation. 


7. ROTATION BETWEEN TWO RESAMPLINGS 


7.1 Rotation Without Updating of Births and Deaths 


We can then stipulate a time of inclusion D, whole and 
constant in the stratum. We have two variants, depending 
on whether we keep the same rounded number or vary it. 


7.1.1 Fixed Rounded Number 


We therefore have a size n, strictly fixed during the 
rotation. We divide n, into D, whole numbers Ny p 
(J =1,...,D,) such that [n, -n fia </* Leteq sbe the 
quotient and r, the remainder of the division of t- ¢, by 
D,, and let n, . = 0. The sample. at time ¢ includes the units 
ranging from rank 1 + q,2), + ‘oe 0%, , torank (q, + 1)n, + 


I= “0M, 
If D, =D, we can stipulate in addition 


[Spree = elas lO 


The variance of the rate of rotation is then practically nil. 

However, the duration of inclusion is not controlled 
when vi < 1: this yields i= 0, Of nm, =) seln the first case, 
there is no rotation, and in the second case, on the contrary, 
the time of exclusion can be considered too short. The 
following method makes it possible to obtain a rotation 
which corresponds to v,. 


7.1.2 Variable Rounded Number 


The sample s,, is defined based on the numbers 
rendered equidistant: 
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The sample size varies between /(v,) and J(v,) + 1 in 
the stratum, and it is independent of the sizes in the other 
strata. This shows us what the result would be of the 
sample rotation advocated by Brewer ef al. (1984) in the 
case of stratified sampling of fixed sized and uniform 
probability in each stratum. 


7.2 Rotation With Updating of Births and Deaths 


To simplify, we assume that each new survey wave is 
accompanied by the introduction of the births since the 
previous wave and a rotation. The method bifurcates into 
two procedures depending on whether or not we wish to 
respect exactly the durations of inclusion D, between two 
resamplings. 


7.2.1 Procedure A 


The births are isolated in separate strata, and we wait for 
the resampling before subtracting the deaths. In this case 
each wave of births is dealt with exactly like an initial 
sampling after attributing the numbers ,. The sampling is 
carried out by stratifying with the same nomenclature (h), 
or with another more scattered or more confined. To 
simplify the notations, but without loss of generality, we 
assume that this is the same nomenclature. The index of 
stratification can then be written (b, h), where b crossed 
with h indicates the wave of births with a particular 
modality b = 1 corresponding to the units already existing 
during the first sampling or a previous resampling. This 
brings us back to the case of section 7.1 in each stratum 
(b, h) and the duration of inclusion is respected exactly. 

The number of strata, and therefore of rounded numbers, 
is multiplied by the number of waves of births. The sample 
size can become fairly random with independent roundings 
(but less so than with Poisson sampling). It may therefore 
be worthwhile to link, at least partially, the rounded 
numbers. For example, we carry out a systematic rounding 
in the dimension h for each b or the reverse. We then keep 
these roundings and this is the 7.1.1 method which then 
applies rather than the 7.1.2 method. 


7.2.2 Procedure B 


In procedure B, we subtract the deaths at each survey 
wave. This is the type of updating presented in section 6. 
We would prefer a fixed duration of inclusion, but that is 
made difficult by the random number of deaths. At most, 
we can try to control a maximum duration of inclusion 
DM,. We may also wish to prevent the units which have 
just left the sample from returning on a future occasion, 
which can occur if the rotation is slow. The idea is to get 
back to the algorithm described in section 6 by removing 
first of all from s, , the units of which the previous duration 
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of inclusion in s,, attained DM,. They are found the 
farthest to the left of the interval [ pp, d»Ph,e,) and are mixed 
with the births too recent to have attained DM,, However, 
these must still be removed in order for the distribution of 
the sample according to the generations to be correct. For 
that, it is sufficient to attribute to the births a fictitious 
previous duration of inclusion which falls between 1 and 
DM,, just after defining the sample. For example, after 
defining S),,> We assign to each unit of B, , belonging to the 
sample the same previous duration of inclusion in the 
sample as that of the unit of U, ,_, situated immediately to 
the left. Then let R, , be the highest rank among the ranks 
according to p, , of thé units of the interval associated with 
S),, which have been included DM, times in the sample; we 
discard the first units of S,, Up to and including rank R 
Finally, this brings us back to the algorithm described in 
section 6 with, for Pia, the number of the unit of rank 
R,, :+1,p,- Temaining ‘that of the unit which follows the 
unit of last rank in Shp 


8. RESAMPLING AFTER ROTATION 


We now reselect the indices of strata h,, iy We define 
the stratification h, as a function of the procedure used for 
the updates of the births. With procedure A, we place the 
births in separate strata, this is the stratification defined by 
crossing the waves of births 6 with the nomenclature h,. 
With procedure B, h, is identical to h, However, we keep 
the notations of the independent quantities of b as f, , D 

The selection of the new sample s,, in a new 
stratification h, must be carried out at period ¢ = ¢,. 

We begin by removing from the previous sample (at 
period ¢=7,-1) the units which have attained the 
maximum authorized duration of inclusion. There remains 
a sample s{ of size n}, of which we would like to conserve 
the maximum number of units in the resampling. 

In the case without rotation examined in section 5, it was 
easy to define the resampling because the sample s, was 
composed of the units of lower rank according to @, in each 
stratum after a real number independent of the w,. In this 
instance, this number is 0. The resampling took place in the 
same manner by selecting the units of lower rank according 
to p, ,, after this number, in the new strata. 

After rotation this no longer works: there is no longer 
any real independent of the numbers such that the sample 

5; 18 composed of units of lower rank after it. This is true 
even in the case where Ty =f,. The problem is obviously 
aggravated with te varying by stratum. The idea which 
then comes to mind is to first carry out a transformation of 
the numbers in such a way that those from s‘ find 
themselves at the beginning of [0, 1). This will then bring 
us back to the case without rotation. This is the same kind 
of idea which is presented by Hidiroglou, Choudhry and 
Lavallée (1991). 

This transformation is fairly immediate in the particular 
case where the updates are done with procedure A and with 
the variable rounded number from section 7.1.2. Without 
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resampling, the selection interval at time ¢, would have 
been: 


Pir Ey ~ 1) F,/Pi > (4 ~ DW In!Pn, fy) 


The resampling results in new strata with probabilities 
th, These include the creations of units between the dates 
bare and t,, to which we attribute equidistant numbers 
p, » in each stratum /,, independently of the survivors. 
They still contain units whose death has occurred since the 
previous sampling. It is possible to define a new sample s, 
in the same way as for Poisson sampling, by the interval, 
i.e., 


Pix ln, ap, *hh,) 


where: 


a, = Gr Sn !Pn, + max OS, 


dpe Vilas a 
a Ai} 


Let us recall that we shift from the supplementary 
quantity 


1 1 
hi ate a aa Iinge if ia caer ini in, 2a 
hy hy 
to prevent the units which have just left the sample from 
returning too quickly. 
As for Poisson sampling, the probability of a survivor 
being in the old and the new sample is then the maximum 


possible, namely: 
, | 


However the size n}, of this sample is random, whereas 
we want a sample of fixed size Ny, We obtain it by 
selecting, in each new stratum h, , aftet having removed the 
deaths, the n i: units of lower rank according to 
ineay iva shee * This number therefore plays, for the 
resampling, the same role that @, played during the first 
sampling. 

If, on the other hand, we chose procedure A with a fixed 
rounded number in the rotation or if we chose procedure B, 
we must begin again with the rank of the units of h, during 
the last updating. This is the rank according to @, with 
procedure A or the rank according to p, - 1 with procedure 
B. Let us assume that NM, is the size of the population at 
date t,- 1. Let Jes. vd be the rank of the unit preceding the 
one of lower ae in sand R, @) the rank of unit 7. The 
number used to laeity the unit in the new strata becomes: 


Peres 
D, 


1 


| Da 


Re) el ah eet O 
(= modulo 1, 


hy 


where: 


a, = Ry va + max (0, my /Ny, - Si) 


With procedure A we can keep 6, = 6, while we make 
a choice of 6, consistent with the last rounded number if 
procedure B i is ‘applied. However, because of the rotation, 
this choice has a minor impact on the coverage and it would 
be almost as well to select at random in [0, 1). 


9. CONCLUSION 


Algorithms based on equidistant numbers do not produce 
SRS. The first-order probabilities of inclusion are not 
exactly controlled and the second-order probabilities are 
unknown. During the changes of stratum, there remains a 
“trace” of the former strata in the new strata. The 
application of the SRS formulae to estimate the variance 
leads to biased results, generally in the direction of 
over-estimation. However, we think that the improvement 
in coverage during resamplings provided by the algorithms 
based on equidistant numbers outweighs the disadvantage 
of biased estimation of the variance and of the confidence 
intervals. According to section 5, the finer the stratification 
the greater this advantage. In particular, the use of 
equidistant numbers seems to be quite indicated with 
procedure A where the strata (5, h) are likely to be very 
small for the waves of births (b>1). The advantage of 
equidistant numbers is not as great with procedure B. 
However, making the numbers of births equidistant renders 
both the number of survivors reselected at each updating of 
the sample and the duration of inclusion less random. 

However, let’s take a quick look at what would change 
in the maintenance if we wanted to conserve SSRS. At 
each stage we must conserve the independent and uniform 
distribution of the @,. First of all, the phases of updating 
the births and of rotation between resamplings described in 
sections 6 and 7 apply while still conserving the same @, 
and the procedure is even simpler. The most delicate part 
is the resampling after the intermediate phase of rotation. 
The objective is to obtain not only a SSRS but also, if 
possible, the same coverage as for method 1 in section 5. 

Let us assume that Oy (j) is the number @ of the unit of 
rank j in a former stratum h,. 

Let us assume first of all that, in a former stratum, all the 
units are such that Ii, n/N, - In particular, this occurs 
in all the strata for a sanipling with a single rate in the 
sampled part, if we do not lower this rate. We then 
endeavour to find a transformation such that the numbers of 
the units of the sample are at the beginning of [0, 1). The 
simplest is the permutation: 


B, G) e a, 0 it N,, i Ry oa)» Js Ry ea? 


B, @) =O, G- Ry oa) I> Ry oa 
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However, a less costly transformation is: 
B, G) a on GV) os a, (My, ) ol us (Ry vas J S Ry oa? 


By G) =o GZ) - a, Ry va) J>Ry va 


It is sufficient to find the result of Oy,, (Ri @ and a, (N, » 
after which a simple sequential calculation makes it 
possible to deduct B from a. 

The Jacobian of the transformation is equal to 1 and 
consequently the numbers conserve their uniform 
distribution. Moreover, the joint distribution p(s,, s,) is the 
same as if there had been no rotation. The demonstration is 
provided in Cotton and Hesse (1992, page 55). We 
therefore have the maximum coverage of SSRS. 

If this yields units with 7, <7), /N, in the stratum and 
we apply the transformation, the units whose rank falls 
approximately between N, vl and n, h, are not reselected 
during the resampling but ‘will be reintroduced during a 
future rotation. It is therefore preferable to use, for these 
units, a transformation which is situated just before sing the 
new numbers. We must proceed by subsets according to 
the value of ie However, that tends to decrease the 
coverage. 
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APPENDIX 
Probabilities of Inclusion in the 
Kish and Scott Method (1971) 


Let us consider an example where the first-order 
probability of inclusion is not strictly controlled. 

The population is divided into three parts A, B and C of 
equal size N. The first sampling is a SRS of 2a units in 
A + B and a SRS of a units in C. During the second 
sampling, we wish to select a units in A and 2a units in 
B+C, while retaining the maximum number of units from 
the first sample and with uniform probability of inclusion 
a/N. The Kish and Scott method consists in adding or 
removing by SRS the appropriate number of units 
separately in A and in B+ C. In A, the second marginal 
sampling is a SRS and the probability of inclusion is quite 
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uniform. We will show that this is not the case in B + C. 
Let n, and n, be the sizes of the two successive samples in 
B. By symmetry, the probability of inclusion during the 
second sampling is uniform in B. It is equal to: 


E(n,)/N = [E(n,) + E(n, - n,)VN 
=alN + E(n, - n,)/N. 
If n, =a,n,-n, =0; otherwise the expected value of 


n, ~ n, conditional on n, differs depending on the sign of 


G—H,; 

If a-n,>0,E[(n,- n,)]|n,J=(a-7n,)(N-7n,)/(2N-n, - a). 
Iagon, <0, E107, 1,) 17) | 
Note p(v,) the probability that the first sample will have the 
size n, in B. This yields: 


E(n, - n,) = > p(n, Elm, - n,)| 74). 


=(a—71,)n, 1 (1, + 2). 


Since the sizes of A and B are equal, p(n,) = p(2a - n,), 
therefore: 


Lay) 


- DY pn {El,-n,) |] +El@, - 1) | a-n,)I} 


ny<a 


= ye D(n,)(a-n, IN - 1, )/(2N - n, - a) - (2a-n,)/(3a-n,)) 


ny<a 


=(2a- N) >is p(n, )la- n,)/[(2N- n,- a)(3a-n,)] 


ny<a 
=(2a-N)K,K>0. 


Except in the case 2a - N =0, E(n, — n,) is not nil and 
E(n,)/N is different from a/N. The probability of 
inclusion is therefore not uniform in B + C. 
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Empirical Bayes Estimation of Small Area Proportions Based 
on Ordinal Outcome Variables 


PATRICK J. FARRELL’ 


ABSTRACT 


Much research has been conducted into the modelling of ordinal responses. Some authors argue that, when the response 
variable is ordinal, inclusion of ordinality in the model to be estimated should improve model performance. Under the 
condition of ordinality, Campbell and Donner (1989) compared the asymptotic classification error rate of the multinominal 
logistic model to that of the ordinal logistic model of Anderson (1984). They showed that the ordinal logistic model had 
a lower expected asymptotic error rate than the multinominal logistic model. This paper also aims to compare the 
performance of ordinal and multinomial logistic models for ordinal responses. However, rather than focussing on 
classification efficiency, the assessment is made in the context of an application where the objective is to estimate small area 
proportions. More specifically, using multinominal and ordinal logistic models, the empirical Bayes approach proposed 
by Farrell, MacGibbon and Tomberlin (1997a) for estimating small area proportions based on binomial outcome data is 
extended to response variables consisting of more than two outcome categories. The properties of estimators based on these 
two models are compared via a simulation study in which the empirical Bayes methods proposed here are applied to data 
from the 1950 United States Census with the objective of predicting, for a small area, the proportion of individuals who 
belong to the various categories of an ordinal response variable representing income level. 


KEY WORDS: Bootstrap; Complex survey design; Logistic regression; Random effects models; Small area summary 


statistics; Taylor series. 


1. INTRODUCTION 


Much research has been conducted into the modelling of 
ordinal responses (see Albert and Chib 1993, Anderson 
1984, Crouchley 1995, and McCullagh 1980). Some 
authors argue that, when the response variable is ordinal, 
inclusion of ordinality in the model to be estimated should 
improve model performance. Under the condition of 
ordinality, Campbell and Donner (1989) theoretically 
compared the asymptotic classification error rate of the 
multinomial logistic model to that of the ordinal logistic 
model of Anderson (1984), demonstrating that the ordinal 
model had a lower expected asymptotic error rate. 
However, in a subsequent simulation study, Campbell, 
Donner, and Webster (1991) illustrated that ordinal models 
classify less accurately than multinomial models under a 
variety of circumstances, and concluded that ordinal models 
confer no advantage when the main purpose of an analysis 
is classification. 

This paper also aims to compare the performance of 
ordinal and multinomial logistic models for ordinal 
responses. However, rather than focussing on classification 
efficiency, the assessment is made in the context of an 
application where the objective is to estimate small area 
proportions. 

The estimation of small area parameters is a finite 
population sampling problem which has received consi- 
derable attention. An excellent review of such research 
appears in Ghosh and Rao (1994). These authors demon- 
strate that as a compromise between synthetic and direct 


survey estimators, estimators based on empirical or 
hierarchical Bayes procedures are not subject to the large 
bias that is sometimes associated with a synthetic estimator 
(see Gonzales 1973), nor are they as variable as a direct 
survey estimator. A similar conclusion was drawn by 
Farrell, MacGibbon, and Tomberlin (1997a) in a study of 
the properties of an empirical Bayes estimator for small area 
proportions based on a binomial outcome variable. 

Despite the numerous studies aimed at predicting small 
area proportions based on binomial response variables (see 
Dempster and Tomberlin 1980, MacGibbon and Tomberlin 
1989, Farrell 1991, Farrell et al. 1997a, Malec, Sedransk, 
and Tompkins 1993, Stroud 1991, and Wong and Mason 
1985), little attention has been given to estimating 
proportions based on response variables with more than two 
outcome categories. This paper extends the empirical 
Bayes approach of Farrell et al., (1997a), to such response 
variables by basing the estimates on multinomial and 
ordinal logistic models. To compare the estimates of small 
area proportions based on an ordinal outcome variable 
using multinomial and ordinal models, the proposed 
empirical Bayes methods are applied to data from the 1950 
United States Census in order to predict, for a given small 
area, the proportion of individuals who belong to the 
various categories of an ordinal response variable 
representing income level. 

For such an estimation problem, there are many issues 
which require attention. They include the selection of 
predictor variables for the model, model diagnostics, the 
sample design, and the properties of the estimators 
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employed. For example, among the model diagnostics for 
the multinomial and ordinal models was an assessment of 
model fit which was based on residuals. For a description 
of this diagnostic and others, see Farrell (1991). The 
findings did not appear to indicate a lack of fit for either 
model. In this study, the focus is on investigating the 
properties of empirical Bayes estimators over repeated 
realizations of the sample design using a simulation. For 
many survey practitioners, such properties are of prime 
importance. 

One concern associated with using an empirical Bayes 
estimation approach is that interval estimates do not attain 
the desired level of coverage, since the uncertainty that 
arises from having to estimate the parameters of the prior 
distribution is not accounted for. This study incorporates 
the suggestion of Laird and Louis (1987) to use bootstrap 
techniques for adjusting naive estimates of accuracy. 
Alternatively, Prasad and Rao (1990) have developed a 
procedure which attempts to account for the uncertainty not 
captured by the naive estimates. Although their approach 
was designed for three specific linear models containing 
random effects, Cressie (1992) has made certain conjectures 
as to when the procedure is appropriate. Of importance is 
the constraint that the outcome variable must follow a 
normal distribution. 

The proposed empirical Bayes procedures based on 
multinomial and ordinal logistic models are presented in 
Section 2. The simulation study to compare multinomial 
and ordinal logistic models for ordinal responses is 
described in Section 3, while the conclusions and 
discussion are presented in Section 4. 


2. ESTIMATION PROCEDURES 


Consider a discrete small area characteristic of interest 
with M possible outcomes. The subscript m will reference 
these categories, where m = 1,..., M- 1 and m* =1,...,M. 
In addition, underlined lower case and capital letters will 
designate vectors, while bold capital letters will represent 
matrices. 

The estimation procedures are illustrated under a two 
stage sample design, where individuals are sampled from 
selected local areas. Thus, local areas are the primary 
sampling units here. Let p, be the proportion of 
individuals in the i-th local area that belong to category m * 
of the response variable. Then 


= De Yam tN it? 


where y,,_, is either zero or one, depending upon whether 
the j-th individual in local area i belongs to category m* of 
the characteristic of interest, and N, is the population size 
of the i-th local area. 

The approach employed by Farrell et al., (1997a), to 
estimate small area proportions based on binomial outcome 
variables is extended here to allow for the estimation of 
Pim.- lhe procedure follows the explicitly model-based 


(2.1) 


Farrell: Empirical Bayes Estimation of Small Area Proportions 


approach proposed by Dempster and Tomberlin (1980). 
Let z,,, represent the probability that the j-th individual 


within the i-th local area belongs to category m” of the 
response variable. Then, according to Royall (1970), p 


in (2.1) is estimated by 
% tw) / 


p im? ( ye y iim? * 
Jes 
where S is the set of n, sampled individuals from local area 
i, and S’ is the set of individuals in local area 7 not included 
in the sample. Values for the i,,,, are required. To obtain 
these estimates, logistic regression models are used to 
describe the probabilities associated with individuals in the 
population. 
Under a multinomial logistic model, the Tims are 
described as follows: 


Im+ 


(2.2) 
je 


HOS (aj | Te puree aca 
nen / (2.3) 
6, ~ 1.1.d. Normal (0, D), 


where = (5,5 --» Sas), 7 = 1, ..., and D is an unknown 
covariance matrix. In this model, X; is a vector of fixed 
effects predictor variables, the vector B contains the fixed 
effects parameters associated with the m-th category of the 
outcome variable of interest, and Oe is a normally 
distributed random effect associated with the m-th category 
of the characteristic of interest in the i-th local area. The 
vector X,, may include covariates at both the individual and 
aggregate levels. For sample designs of more than two 
stages, an analogous model would contain random effects 
for the sampling units at each stage, excluding the final one. 

Note that the model in (2.3), unlike a similar model 
proposed by Malec ef al., (1993), does not contain 
interaction terms between the local area effects and the 
fixed effects predictor variables. However, terms to 
acknowledge such interaction could be included if they 
were deemed necessary. 

To obtain Bayes estimates of the model parameters, 
values are assumed for the Sanus parameters of the 
random effects distribution. Let a = = Dy yp) bea 
vector for the ij-th sampled individual ‘where the component 
associated with the category of the outcome variable to 
which the individual belongs has a value of one. ee 
remaining entries are zero. If Yis a matrix with rows ne ; 
then the data are distributed as: 


Yl ne YiiM 
Dealing Mii Mya -++ Ming 


where 87 = (B77 ..., By,,), and 87 = (Oe 3°). If a flat 
distribution is specified for the fixed effects, thé distribution 
of the parameters is /(B,5.|D,) «exp(-%28/D,6.), 

where D , = diag (D, D, ..., D). The joint distribution of the 
data and the parameters is determined using f(Y| B, 5.) and 
f(B,5,|D.), and subsequently employed to obtain the 
posterior distribution of the parameters. Unfortunately, a 


f(¥|8B, 8 
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closed form for this posterior distribution cannot be derived 
due to the intractable integration required to obtain the 
marginal distribution of Y. A possible approach could be a 
stochastic integration method such as Gibbs sampling (see 
Zeger and Karim 1991). Ripley and Kirkland (1990) 
indicate that the drawbacks of such an approach include the 
intensive computations and questions about when the 
sampling process has achieved equilibrium. Since 
computing time is of particular concern for the simulation 
discussed in Section 3, this approach will not be pursued 
here. Alternatively, Breslow and Clayton (1993) state that 
there is still room for simple, approximate methods. Many 
authors have found that a multivariate normal approxi- 
mation of the posterior works very well in practice (see 
Farrell et al. 1997a, Laird 1978, Tomberlin 1988, and 
Wong and Mason 1985). Breslow and Lin (1995) warn, 
however, that such an approach might yield inconsistent 
estimates for the fixed effects parameters. Thus, if p,,, is 
to be based on fixed effects estimates obtained in this 
manner, the same might apply to the consistency of p,,, as 
an estimator for p,,.. 

Following Farrell et al. (1997a), the posterior distri- 
bution of the parameters is approximated as a multivariate 
normal distribution having its mean at the mode and 
covariance matrix equal to the inverse of the information 
matrix evaluated at the mode. The information matrix here 
is simply the second derivative of the posterior distribution 
taken with respect to B and 6 . When values are specified 
for the unknown parameters of the random effects 
distribution, the resulting mode and covariance matrix 
constitute an initial set of estimates of the model 
parameters. Empirical Bayes estimates are then obtained by 
using the EM algorithm described by Dempster, Laird, and 
Rubin (1977) to determine estimates for the parameters of 
the random effects distribution. The algorithm converges 
quickly, taking only a few minutes in real time. For details 
on how the empirical Bayes estimates are obtained for a 
model based on a two stage sample design and a binomial 
response variable, see MacGibbon and Tomberlin (1989). 

The empirical Bayes estimates of the model parameters 
are used in (2.2) to determine p,,.. In developing an 
expression for the uncertainty of p,,,, N, is assumed to be 
known. Since the approach being used is model-based and 
predictive in nature, the uncertainty in p,,, arises solely 
from the )’%.,,,,, term; the )’y,,,, term has zero variance. 
Thus, the mean square error of p,,, as a predictor for p 
can be estimated as 


— ——— », Nim * 
MSE(#,. .) = Var| &—— 


I 


im+ 


Det teeter eae) 


+ Jes 


(2.4) 


For sampled local areas, where n, is greater than zero, the 
first term of (2.4) is of order 1/n,, while the second term is 
of order 1/N,. In this study, the approximation of the mean 
square error of p,,, is based on the first term only, which 
yields a useful approximation provided that N, is large 


121 


compared to n,. For nonsampled local areas, the first term 
in (2.4) is of order 1; therefore it always dominates the 
second term. 

To estimate the uncertainty of f,,,,, which is expressed 
as a non-linear function of the estimators of the fixed and 
random effects, the expression for p,,, is linearized by 
taking a first order multivariate Taylor series expansion 
about the realized values of the fixed and random effects. 
The variance of the resulting expression, call it Var(,,,), 
is taken as an estimate of the uncertainty of p,,,. Details of 
the Taylor series expansion are given in Farrell et al., 
(1997a), for a binomial outcome variable. 

When population micro-data for auxiliary variables are 
not available, p,,,, in (2.2) cannot be determined. For non- 
linear models such as (2.3), prediction is not straight- 
forward in this situation. However, an alternative estimator 
to p,,,, Say P,,,,, which requires only local area summary 
statistics (a mean vector and finite population covariance 
matrix) for both continuous and categorical variables can be 
obtained by extending the approach proposed by Farrell, 
MacGibbon, and Tomberlin (1997b) for achieving this 
objective when estimating binomial small area parameters. 
The same Taylor series expansion that was used to estimate 
the accuracy of ,,,, can be employed to obtain a measure 
of the uncertainty for p,, ., Var(@,,,.)- 

The approach described in this section can also be used 
to develop point and interval estimates for small area 
proportions based on p,,,, and p,,,, when an ordinal model 
is used. In this study, a fixed and random effects model is 
proposed for the a which is based on the ordinal model 
proposed by McCullagh (1980) 


| Ti, + Sar Tim 


+ 
Mijn +1)" > Taj 


| é Bom : x78 as Sin? 
(2.5) 


8, ~ ii.d. Normal (Q, D). 


The vector X,, contains the values of the fixed effects 
predictor variables for the jij-th individual, while B 
represents a vector of fixed effects parameters. Associated 
with the m-th category of the response variable is a constant 
term, B,,,, The random effects are again assumed to be 
normally distributed. Note that an important feature of the 
model in (2.5) is that the restriction Boonst) PUD reat Ort 


im 
Simi) Must hold in order for 7,,,,,, 20. A discussion 


concerning this constraint is given in Section 35 

The approach used to approximate the uncertainty in 
Pim, and p,,, When 7,,,, is based_on either (2.3) or (2.5) 
can be described as naive, since Var(p,,,,) and Var(D,, ,) 
do not account for the uncertainty which results from 
estimating the parameters of the random effects distribu- 
tion. Thus, interyal estimates for p, that are based on 
les —m. Im+ 
Var(p,,,) and Var(p,,.) are typically too short. Many 
approaches have been proposed for addressing this issue 
(see Carlin and Gelfand 1990, and Laird and Louis 1987). 
In this study, the Type III bootstrap proposed by Laird and 
Louis (1987) is used to adjust naively-estimated measures 
of uncertainty. The procedure is described in Farrell et al., 
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(1997a), for a binomial outcome variable. It can be 
extended to (2.3) and (2.5), and is applicable regardless of 
whether estimation is based on p,,, or D,,,,. 

The procedure requires that a number of bootstrap 
samples, N3, be generated from a given set of data. 
Suppose that small area estimation is to be based on /,, 
For the b-th bootstrap sample, an estimate p,,, for p,,,. 
based on (2.3) or (2.5), along with a naive estimate of the 
variability of Doim+; 5) are obtained. The quan- 
tities P,,,,. and Gara determined for each of NV, 
bootstrap samples, and used to calculate a bootstrap- 
adjusted estimate of the variability associated with p.: 


im+* 
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Note that even though individuals are not selected by 
simple random sampling without replacement in this study, 
survey weights have not been attached to the records. 
However, in practice, the weights attached to a record will 
vary due to features of the survey design, such as 
differential nonresponse and clustering. In this study, the 
models account for the effects of these features. Further 
research is necessary to determine what impact the 
incorporation of survey weights into the models would have 
on the bootstrapping procedure. 


3. A DATA EXAMPLE 


A comparison of the estimates for small area proportions 
based on multinomial and ordinal logistic models was 
carried out using a simulation study where the response 
variable was ordinal. The data set is based on a 1% sample 
of the 1950 United States Census (United States Bureau of 
the Census 1984). Data based on the 1950 Census is used 
since it constitutes a public use microdata sample, and none 
of the more recent census data is available in this form. 
Thus, the results below for the multinomial and ordinal 
models are obtained by using predictor variable data for 
each individual within a local area. For a discussion of the 
difficulties encountered in obtaining microdata, see 
Bethlehem, Keller, and Pannekoek (1990). 

The application considered is the estimation of the 
proportion of individuals in a given local area associated 
with each of the three categories of an ordinal outcome 
variable representing total personal income, where a local 
area is typically specified to be a state. This variable 
encompasses all sources of income, including wages and 
salaries, business income, and net income from other 
sources. An individual is regarded as having a low (less 
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than $2,500), medium ($2,500 to under $10,000) or high 
($10,000 and over) level of total personal income during 
1949. Thus, m = 1 for low income (Category 1), m =2 for 
medium income (Category 2), and m = 3 for high income 
(Category 3). The multinomial and ordinal models were 
each used to obtain point and interval estimates for 42 local 
areas. Twenty of these areas were sampled, the others were 
not. Note that individuals with no income were included in 
Category 1. An alternative approach would have been a 
two stage model; a first stage logistic model for the 
probability of non-zero income, and a second stage 
multinomial or ordinal model for income category 
conditional on non-zero income. 

In practice, historical data are often available for survey 
planning purposes. For example, variable selection for 
purposes of model predictions could be based on previous 
census data. To emulate this situation, a random sample of 
size 2,000 was selected from the 1% sample. Variables for 
model prediction were determined by applying a stepwise 
logistic regression procedure. The variables selected were 
age, gender, and race. With regards to race, individuals 
were categorized as white, negro, or other. 

Thus, the multinomial and ordinal models used in this 
study included four individual level predictor variables for 
age, gender, and race (two indicator variables were required 
to code the various races). However, they also contained 
four local area variables representing average age, the 
proportion of males, the proportion of whites, and the 
proportion of negroes. Regardless of which model is 
considered, these local area variables are necessary since, 
when they are excluded, a relationship is noted between the 
expected value of p,, and its bias, where as the expected 
value increases, the bias increases from large negative to 
large positive values. The inclusion of domain level 
covariates removes this correlation. Therefore, since local 
area variables are also included in the models, the 
multinomial model contains eighteen fixed effects para- 
meters (two for each of the individual level and local area 
predictor variables, and two constant terms) and forty 
random effects (two for each of the twenty sampled local 
areas), while the ordinal model contains ten fixed effects 
parameters (one for each of the individual level and local 
area predictor variables, and two constant terms) and forty 
random effects (two for each of the twenty sampled local 
areas). For a detailed study comparing logistic regression 
models for estimating small area proportions with and 
without domain level covariates which uses binomial 
outcome data, see Farrell et al., (1997a). 

The data for estimating the proportions of individuals in 
each local area belonging to the various income level 
categories were obtained from the 1% sample using a self- 
weighting two stage sample design. In the first stage, 20 
out of 42 local areas were selected, without replacement, 
using probabilities proportional to size (PPS). More 
specifically, the approach used to select these local areas 
was randomized systematic selection of primary sampling 
units with PPS (see Kish 1965, p. 230). Then, at the second 
stage, 50 individuals were randomly selected from each 
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chosen local area. A total of 500 samples were drawn using 
this two stage design; however, resampling was not 
performed at the local area selection stage. Thus, the same 
20 local areas were sampled in each of the 500 replicates. 
For these 20 sampled local areas, the average local area 
proportions for Categories 1, 2, and 3 of income level are 
0.7142, 0.2260, and 0.0598. 

Note that for the ordinal model, the constraint Be = 
B,, 2 5,, - 6,, must hold in order for m,, > 0. A check of 
this constraint for each of the 500 samples using the 
estimates for the constant terms and the random effects 
indicated that it held at all times. In fact, it was discovered 
that in each of the 500 samples taken, the difference in the 
estimates for the constant terms was always positive, at 
least two orders of magnitude larger than the majority of the 
absolute differences of the random effects estimates, and 
always one order of magnitude bigger. Thus, the constant 
terms in the model dominate over the random effects. 

To compare the properties of estimators for small area 
proportions over repeated realizations of the sample design, 
for each of the 500 samples selected the quantities 
A A Bees 3 , 

Pin VEG, ,.)» and Var''(,,,) associated with each 
income level category were obtained for each local area, 
sampled or not, using both the multinomial and ordinal 
mgdels. For each model, the estimates for Var(,,,,) and 
Var'(p, ) were used to construct naive and bootstrap- 
adjusted empirical Bayes symmetric 95% confidence 
intervals, respectively. Estimates for Var (Ny) were 
obtained by using the bootstrap procedure to generate 100 
bootstrap samples from each of the 500 simulation samples. 

Note that for the ordinal model, the constraint B,, - 
Bo, 2 5,, — 5,, must also hold in the bootstrap procedure for 
random effects generated from an estimated distribution; 
otherwise negative estimates for some of the probabilities 7, 
will result when creating bootstrap samples. Over the 
course of the simulation for the application considered here, 
no negative probabilities were encountered when 
bootstrapping. One approach for assessing the likelihood of 
negative probabilities during the bootstrap procedure is to 
consider the ratio of the difference B,,-B), to the 
estimated prior standard deviation of the difference 
6,, ~ 9,- This ratio was determined for each sampled local 
area in each of the 500 simulation samples taken. The 
average of this entire set of ratios was 6.8, and none were 
found to be less than 5.8. Thus, the difference f,, - Bo, 
was determined to always be at least 5.8 times the estimated 
standard deviation of the difference 6,, - 6,,. Based on the 
empirical rule, a rule of thumb would be to conclude that 
when the ratio described above is at least three, it is highly 
unlikely that negative probabilities will arise when 
bootstrapping. 

Table 1 presents average summary statistics over the 500 
simulation samples obtained for the multinomial and 
ordinal models across all sampled local areas for each of 
three income level categories. A study of the stability of 
these statistics was conducted by investigating how they 
changed as additional samples were taken. Only slight 
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changes were observed once 150 samples had been reached. 
Table 1 includes the summary statistics obtained for the 
first 200 samples in brackets for comparative purposes. 

For each income category, two summary statistics shown 
in Table 1 were evaluated to compare the design bias of 
P,». tor the multinomial and ordinal models; the average 
bias of p,,,, and the average absolute bias of p,,.. The 
average bias is simply the mean over all sampled local areas 
of the differences obtained when the actual proportion, 
Pim.» for the i-th local area is subtracted from the average 
point estimate for the area over the 500 simulation samples. 
The average absolute bias is defined similarly, except that 
the absolute value of each difference is used. Generally 
speaking, the results obtained for these two summary 
Statistics were slightly better for the ordinal model, 
regardless of the income category considered. However, 
the multinomial model did result in a somewhat smaller 
average bias for p,,, for the low income category. 

For each sampled local area, empirical root mean square 
errors (RMSE’s) were computed over the 500 simulation 
samples under each model for the three income categories. 
For each model and income level combination, the 
appropriate empirical RMSE’s were averaged over all 
sampled local areas, resulting in the average empirical 
RMSE’s presented in Table 1. Once again, the perfor- 
mance of the ordinal model is slightly better for all three 
income level categories. 

To study the reduction in empirical RMSE when a 
model-based approach to estimation is used instead of a 
classical design unbiased method, average empirical 
RMSE’s analogous to those in Table 1 based on the 500 
samples were computed using the observed local area 
sample proportions in place of p,,. The average empirical 
RMSE ’s obtained were substantially larger (0.0617, 0.0564, 
and 0.0311 for the low, medium, and high income level 
categories) than those based on p,,, under either model. 

Table 1 also includes summary statistics over all sampled 
local areas which relate naive and bootstrap measures of 
variability in p,,, to average empirical RMSE. For each 
income level category, the average relative bias and the 
ayerage absolute relative bias of the square root of 
Var(P,,,,) aS an estimate of empirical RMSE are shown in 
Table 1 for the multinomial and ordinal models. The 
average relative bias is simply the mean over all sampled 
local areas of the values obtained when the difference 
resulting from the subtraction of the empirical RMSE for 
the i-th local area from the average of the square root of 
Var(,,,,) for the area over the 500 simulation samples is 
divided by the empirical RMSE. The average absolute bias 
is defined similarly, expect that the absolute value of each 
difference is used. The table also presents similar averages 
for the bootstrap-adjusted measures of variability, 
Var"\(6__). For both the multinomial and ordinal logistic 
models, ‘the average relative bias and average absolute 
relative bias of the bootstrap-adjusted estimates of 
variability are substantially smaller in magnitude than their 
naive counterparts for all three income level categories. In 
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Table 1 
Average Summary Statistics based on 500 Simulation Samples for the Multinomial and Ordinal Logistic Models 
across all Sampled Local Areas for each Income Level Category. 
The average summary statistics obtained over the first 200 simulation samples are included in brackets for comparative purposes 


Low Income Level 


say eran Multinomial Ordinal 
: x -0.0004 -0.0005 

Pes OE (-0.0004) —(-0.0006) 

: ~ 0.0076 0.0051 

Absolute Bias of Di x (0.0078) (0.0055) 

a 0.0479 0.0467 
En DONE (0.0483) (0.0469) 
Relative Bias of -0.1192 -0.1125 
-0.1197 -0.1128 
(Var, ( ) ( ) 
Absolute Relative Bias of 0.1192 0.1125 
% (0.1197) (0.1128) 
yj Var(@,,,.) 

Relative Bias of -0.0275 -0.0173 
Fe (-0.0272) (-0.0175) 
Var @, a 

Absolute Relative Bias of 0.0294 0.0227 

1B) (0.0290) (0.0228) 

Varel(Des) 

\ 91.35 91.91 
Naive Coverage Rate (91.325) (91.875) 
Absolute Deviation of Naive 3.65 3.09 

Coverage from the 95% Nominal Rate (3.675) (Bs) 

: 94.44 94.75 
Adjusted Coverage Rate (94.400) (94.775) 
Absolute Deviation of Adjusted 1.58 1.43 

Coverage from the 95% Nominal Rate (1.600) (1.425) 


addition, these bootstrap-adjusted average summary 
Statistics are all very small, which indicates that the 
bootstrap-adjusted estimates of variability are capable of 
incorporating most of the uncertainty that arises from 
having to estimate the distribution of the random effects. 
For each sampled local area, naive and bootstrap- 
adjusted coverage rates based on 95% interval estimates 
were computed over the 500 samples under each model for 
the three income level categories. Over all income level 
and model combinations, the bootstrap-adjusted coverage 
rates for individual local areas ranged from 92.2% to 
97.6%. Since an approximate bound for the Monte Carlo 
error is 3. ¥(0.95)(0.05)/500, or 0.029, all bootstrap- 
adjusted coverage rates are within 3 standard errors of 95%. 
For each model and income level combination, the 
appropriate coverage rates were averaged over all sampled 
local areas, resulting in the average naive and bootstrap- 
adjusted coverage rates in Table 1. A number of 
observations can be made which hold for each income level 
category. For both multinomial and ordinal models, the 
average coverage rates for the bootstrap-adjusted intervals 
are much closer to the 95% nominal rate than those 
associated with the naive intervals. However, both the 
average naive and bootstrap-adjusted coverage rates for the 


Medium Income Level High Income Level 


Multinomial Ordinal Multinomial Ordinal 
-0.0007 -0.0004 0.0011 0.0009 
(-0.0006) (-0.0003) (0.0010) (0.0009) 
0.0089 0.0048 0.0108 0.0074 
(0.0085) (0.0046) (0.0106) (0.0073) 
0.0417 0.0401 0.0236 0.0231 
(0.0414) (0.0402) (0.0233) (0.0229) 
-0.1273 -0.1180 -0.1524 -0.1376 
(-0.1276) (-0.1186) (-0.1521) (-0.1372) 
0.1273 0.1180 0.1524 0.1376 
(0.1276) (0.1186) (0.1521) (0.1372) 
-0.0309 -0.0204 -0.0391 -0.0273 
(-0.0314) (-0.0207) (-0.0393) (-0.0269) 
0.0349 0.0263 0.0450 0.0353 
(0.0343) (0.0265) (0.0446) (0.0347) 
91.19 91.78 90.67 91.26 
(91.225) (91.750) (90.650) (91.300) 
3.81 B22 4.33 3.74 
(Ba7/7/5)) (3.250) (4.350) (3.700) 
94.37 94.68 93.91 94.40 
(94.350) (94.650) (93.925) (94.375) 
Wek 1.50 1.91 1.62 
(1.725) (1.525) (1.900) (1.650) 


ordinal model are slightly better than counterparts for the 
multinomial model. This is also the case for the average 
absolute deviation of both the naive and bootstrap-adjusted 
coverage rates from the 95% nominal rate. The average 
absolute deviation of the naive coverage rates from the 95% 
nominal rate is simply the mean over all sampled local areas 
of the absolute values of the differences obtained when the 
95% nominal rate is subtracted from the naive coverage 
rates for the sampled local areas over the 500 simulation 
samples. The average absolute deviation of the bootstrap- 
adjusted coverage rates from the 95% nominal rate is 
defined analogously. 

Twenty-two local areas were not sampled. Estimates for 
the proportion of individuals associated with each income 
level category were also obtained for these areas using the 
multinomial and ordinal models. The findings were similar 
to those for sampled local areas. However, the performance 
of the models deteriorated somewhat, since nonsampled 
local areas constitute a holdout sample. For a detailed 
evaluation of results associated with nonsampled local 
areas, see Farrell et al. (1997a). 

A comparison of the estimates for the three income level 
categories based on micro-data, p,,, with those based on 
local area summary statistics, D,., was also made for each 
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model. For both models, the results obtained for J, 
gratifyingly close to those obtained using f,,,, although 
those obtained for p,,, were slightly better. Similar 
findings were obtained by Farrell et al., (1997b) in a de- 
tailed comparison of p,,,, and p,,, for a binomial outcome 
variable. 


, were 


im+ 


4. CONCLUSION 


Using multinomial and ordinal logistic models, the 
empirical Bayes approach proposed by Farrell eft al., 
(1997a), for estimating small area proportions based on 
binomial outcome data has been extended to accommodate 
outcome variables with more than two categories. It was 
found that the performance of the approach is preserved for 
multicategorical outcome data. 

To compare the estimates of small area proportions 
based on an ordinal outcome variable using multinomial 
and ordinal logistic models, the proposed empirical Bayes 
methods based on these two models were applied to data 
from the 1950 United States Census with the objective of 
predicting, for a small area, the proportion of individuals 
who belong to the various categories of an ordinal response 
variable representing income level. The estimates based on 
the ordinal model were only slightly better in terms of 
design bias, empirical RMSE, and coverage rates. In 
addition, an important feature of the Airs logistic model 
is that the constraint Bo...) ~ Bom 2 Sim ~ Sign+1y Must hold 
in order for 1, .,,,,,20. Since the soe for the 
multinomial and ordinal models in the simulation were very 
similar, a multinomial model could be used for estimating 
small area proportions based on ordinal outcome variables 
when there is concern that fitting an ordinal model may 
result in negative estimates for some of these probabilities. 


ACKNOWLEDGEMENTS 


This research was supported by NSERC of Canada. The 
author is grateful to the associate editor and the referees for 
their valuable comments and suggestions. 


REFERENCES 


ALBERT, J.H., and CHIB, S. (1993). Bayesian analysis of binary and 
polytomous response data. Journal of the American Statistical 
Association, 88, 669-679. 


ANDERSON, J.A. (1984). Regression and ordered categorical 
variables. Journal of the Royal Statistical Society, Series B, 46, 
1-30. 


BETHLEHEM, J.G., KELLER, W.J., and PANNEKOEK, J. (1990). 
Disclosure control of microdata. Journal of the American 
Statistical Association, 85, 38-45. 


BRESLOW, N.E., and CLAYTON, D.G. (1993). Approximate 
inference in generalized linear mixed models. Journal of the 
American Statistical Association, 88, 9-25. 


125 


BRESLOW, N.E., and LIN, X. (1995). Bias correction in generalised 
linear mixed models with a single component of dispersion. 
Biometrika, 82, 81-91. 

CAMPBELL, M.K., and DONNER, A. (1989). Classification 
efficiency of multinomial logistic regression relative to ordinal 


logistic regression. Journal of the American Statistical 
Association, 84, 587-591. 


CAMPBELL, M.K., DONNER, A., and WEBSTER, K.M. (1991). 
Are ordinal models useful for classification? Statistics in 
Medicine, 10, 383-394. 

CARLIN, B.P., and GELFAND, A.E. (1990). Approaches for 
empirical Bayes confidence intervals. Journal of the American 
Statistical Association, 85, 105-114. 


CRESSIE, N. (1992). REML Estimation in empirical Bayes smoothing 
of census undercount. Survey Methodology, 18, 75-94. 


CROUCHLEY, R. (1995). A random-effects model for ordered 
categorical data. Journal of the American Statistical Association, 
90, 489-498. 

DEMPSTER, A.P., LAIRD, N.M., and RUBIN, D.B. (1977). 
Maximum likelihood estimation from incomplete data via the EM 
algorithm. Journal of the Royal Statistical Society, Series B, 39, 
1-38. 

DEMPSTER, A.P., and TOMBERLIN, T.J. (1980). The analysis of 
census undercount from a postenumeration survey. Proceedings 
of the Conference on Census Undercount, Arlington, VA, 88-94. 

FARRELL, P.J. (1991). Empirical Bayes Estimation of Small Area 
Proportions. PhD. dissertation, Department of Management 
Science, McGill University, Montreal, Quebec, Canada. 

FARRELL, P.J., MacGIBBON, B., and TOMBERLIN, T.J. (1997a). 
Empirical Bayes estimators of small area proportions in multistage 
designs. Statistica Sinica, 7, 1065-1083. 

FARRELL, P.J., MacGIBBON, B., and TOMBERLIN, TJ. (1997b). 
Empirical Bayes small area estimation using logistic regression 


models and summary statistics. Journal of Business and Economic 
Statistics, 15, 101-108. 


GHOSH, M., and RAO, J.N.K. (1994). Small area estimation: an 
appraisal. Statistical Science, 9, 55-93. 


GONZALES, M.E. (1973). Use and evaluation of synthetic 
estimation. Proceedings of the Social Statistics Section, American 
Statistical Association, 33-36. 


KISH, L. (1965). Survey Sampling. New Y ork: John Wiley & Sons Inc. 


LAIRD, N.M. (1978). Empirical Bayes methods for two-way 
contingency tables. Biometrika, 65, 581-590. 


LAIRD, N.M., and LOUIS, T.A. (1987). Empirical Bayes confidence 
intervals based on bootstrap samples. Journal of the American 
Statistical Association, 82, 739-750. 


MacGIBBON, B., and TOMBERLIN, T.J. (1989). Small area 
estimates of proportions via empirical Bayes techniques. Survey 
Methodology, 15, 237-252. 


MALEC, D., SEDRANSK, J., and TOMPKINS, L. (1993). Bayesian 
predictive inference for small areas for binary variables in the 
National Health Interview Survey. In Case Studies in Bayesian 
Statistics, (Eds. C. Gatsonis, J.S. Hodges, R. Kasf, and N.D. 
Singpurwalla). New York: Springer Verlag. 


126 


McCULLAGH, P. (1980). Regression models for ordinal data. 
Journal of the Royal Statistical Society, Series B, 42, 109-142. 


PRASAD, N.G.N., and RAO, J.N.K. (1990). On the estimation of 
mean square error of small area predictors. Journal of the 
American Statistical Association, 85, 163-171. 


RIPLEY, B.D., and KIRKLAND, M.D. (1990). Iterative simulation 
methods. Journal of Computational and Applied Mathematics, 31, 
165-172. 


ROYALL, R.M. (1970). On finite population sampling theory under 
certain linear regression models. Biometrika, 74, 1-12. 


STROUD, T.W.F. (1991). Hierarchical Bayes predictive means and 
variances with application to sample survey inference. 
Communications in Statistics, Theory and Methods, 20, 13-36. 


Farrell: Empirical Bayes Estimation of Small Area Proportions 


TOMBERLIN, T.J. (1988). Predicting accident frequencies for 
drivers classified by two factors. Journal of the American 
Statistical Association, 83, 309-321. 


UNITED STATES BUREAU OF THE CENSUS (1984). Census of 
the Population, 1950: Public Use Microdata Sample Technical 
Documentation, edited by J.G. Keane, Washington, D.C. 


WONG, G.Y., and MASON, W.M. (1985). The hierarchical logistic 
regression model for multilevel analysis. Journal of the American 
Statistical Association, 80, 513-524. 


ZEGER, S.L., and KARIM, M.R. (1991). Generalized linear models 
with random effects; a Gibbs sampling approach. Journal of the 
American Statistical Association, 86, 79-86. 


Survey Methodology, December 1997 
Vol. 23, No. 2, pp. 127-135 
Statistics Canada 


127 


Poststratification Into Many Categories Using Hierarchical 
Logistic Regression 


ANDREW GELMAN and THOMAS C. LITTLE! 


ABSTRACT 


A standard method for correcting for unequal sampling probabilities and nonresponse in sample surveys is poststratification: 
that is, dividing the population into several categories, estimating the distribution of responses in each category, and then 
counting each category in proportion to its size in the population. We consider poststratification as a general framework 
that includes many weighting schemes used in survey analysis (see Little 1993). We construct a hierarchical logistic 
regression model for the mean of a binary response variable conditional on poststratification cells. The hierarchical model 
allows us to fit many more cells than is possible using classical methods, and thus to include much more population-level 
information, while at the same time including all the information used in standard survey sampling inferences. We are thus 
combining the modeling approach often used in small-area estimation with the population information used in 
poststratification. We apply the method to a set of U.S. pre-election polls, poststratified by state as well as the usual 
demographic variables. We evaluate the models graphically by comparing to state-level election outcomes. 


KEY WORDS: Bayesian inference; Election forecasting; Nonresponse; Opinion polls; Sample surveys. 


1. INTRODUCTION 


It is standard practice for weighting in opinion polls to be 
based entirely or primarily on poststratification, which we 
use generally to refer to any estimation scheme that adjusts 
to population totais. The basic approach is to divide the 
population into a number of categories, within each of 
which the survey is analyzed as simple random sampling. 
The poststratification step is to estimate population quan- 
tities by averaging estimates in the categories, counting 
each category in proportion to its size in the population. 
Poststratification categories are typically based on demo- 
graphic characteristics (sex, age, etc.) as well as any varia- 
bles used in stratification. Another level of complication, 
which we do not address here, would occur under cluster 
sampling. 

There is a fundamental difficulty in setting up post- 
stratification categories. It is desirable to divide the 
population into many small categories in order for the 
assumption of simple random sampling within categories to 
be reasonable. But if the number of respondents per 
category is small, it is difficult to accurately estimate the 
average response within each category. For example, if we 
poststratify by sex, ethnicity, age, education, and region of 
the U.S., some cells may be empty in the sample, whereas 
others may have only one or two respondents. 

A general solution to this problem is to model the 
responses conditional on the poststratification variables (see 
Little 1993). For example, the standard approach to 
adjusting for several demographic variables is to rake 
across one-way or two-way margins (i.e., iterative 
proportional fitting, Deming and Stephan 1940), which 
essentially corresponds to poststratification on the complete 
multi-way table, but with a model of the responses, 


conditional on the demographic variables, that sets 
higher-level interactions to zero. Methods based on 
smoothing weights can also be viewed as poststratification, 
with corresponding models on the responses (see Little 
1991). When the poststratification categories follow a 
hierarchical structure (for example, persons within states in 
the U.S.), one can improve efficiency of estimation by 
fitting a hierarchical model (e.g., Lazzeroni and Little 
1997). In the related context of regression estimation, 
Longford (1996) demonstrates the potential for hierarchical 
linear models to improve the precision of small area 
estimates based on sample survey data. 

In this paper, we set up a hierarchical logistic regression 
model to be used for poststratification estimates for a binary 
variable. The advantage of the model, compared to standard 
poststratification, is that it allows for the use of many more 
categories, and thus much more detailed population 
information. The practical gains from this method are 
greatest for small subgroups of the population. We apply 
the method to the state-level results of a set of U.S. 
pre-election polls. This example has the nice feature that 
we can check our inferences externally by comparing to 
state-level election outcomes. Details appear in an 
appendix for computing the hierarchical model using an 
approximate EM algorithm. 


2. MODEL 


2.1 Sampling and Poststratification Information 


Consider a partition of the population into R categorical 
variables, where the r-th variable has J, levels, for a total 
of J = 8 BG categories (cells), which we label 7 = 1, ..., J. 
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Assume that V., the number of units in the population in 
category j, is known for all 7. Let y be a binary response of 
interest; label the population mean response in each cate- 
gory j7 as m,;. Then the overall population mean is 
Y=).N. n/N, Assume that the population is large 
enough that we can ignore all finite-population corrections. 

A sample survey is now conducted in order to estimate Y 
(and perhaps some other combinations of the 7,’s). For 
each j, let n. be the number of units in category / in the 
sample. Conditional on the RX explanatory variables, assume 
that nonresponse is ignorable (Rubin 1976). Thus, the R 
variables should include all information used to construct 
survey weights, as well as any other variables that might be 
informative about y. 

For the example we shall consider in Section 3, we 
categorize the population of adults in the 48 contiguous 
U.S. states by R=5 variables: state of residence, sex, 
ethnicity, age, and education, with (,,...,J.) = 
(48, 2, 2,4,4). (Ethnicity, age, and education are discre- 
tized into 4 categories each, as described in Section 3.1.) 
The /=3,072 categories range from “Alabama, male, 
black, 18-29, not high school graduate” to “Wyoming, 
female, nonblack, 65+, college graduate,” and, from the 
U.S. Census, we have good estimates of N; in each of these 
categories. We shall consider population estimates 
(summing over all 3,072 categories) and also estimates 
within individual states (separately summing over the 
64 categories for each state). It is impossible for a 
reasonably-sized sample survey to allow independent 
estimates of the mean responses7. for each category / (in 
fact, the vast majority of categories will be empty or contain 
just one respondent), and so it is necessary to model the 
T; ’s in order to poststratify and thus make use of the known 
category sizes N.. The (potential) advantage of post- 
stratification is to correct for differential nonresponse rates 
among the categories. 


2.2 Regression Modelling in the Context of 
Poststratification 


One can set up a logistic regression model for the 
probability T, of a “yes” for respondents in category /: 


logit(n,) = X; B, (1) 


where X is a matrix of indicator variables, and X, is the 
j-th row of X. If we were to assume a uniform prior 
distribution on f, then Bayesian inference, for different 
choices of X, under this model corresponds closely to 
various classical weighting schemes. These correspon- 
dences, which we present below, are general and rely on the 
linearity of the assumed model (that is, x, 6B in (1)). (In the 
case of binary data, which we are considering in this paper, 
the classical and uniform-prior-Bayesian estimates are not 
identical, because of the nonlinear logistic transformation 
in (1), but for large samples the differences are minor.) 

The following models correspond to the most commonly 
used classical poststratification estimates. 
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— Setting X to the JxJ identity matrix corresponds to 
weighting each unit in cell 7 by N/n;; that is, simple 
poststratification. This method is well known to work 
well only if the ,’s are reasonably large (and it will not 
work at all if n,= O for any / ). 

— If we set X to the J x (YJ) matrix of indicators for 
each individual variable, then the estimate of Y corres- 
ponds approximately to that obtained by raking across all 
R one-way margins. 

— Including various interactions in X corresponds to 
including these same interactions in the raking. To put it 
most generally, assuming “structure” of any kind in X 
corresponds to pooling the poststratification across cells 
in some way. 

— Including no explanatory variables in the model (that is 
letting X be simply a vector of 1’s) leads to the sample 
mean estimate y. 

See Holt and Smith (1979) and Little (1993) for more 

discussion of the relation between weighting estimates and 

poststratification. 


2.3 Hierarchical Regression Modelling for Partial 
Pooling 


When the number of cells is large, none of the above 
options makes efficient use of the information provided by 
the categories (for example, simple poststratification gives 
estimates that are too variable, but if we exclude explana- 
tory variables with many categories, we are discarding 
important information). Instead, we allow partial pooling 
across cells by setting up a mixed-effects model (see, e.g., 
Clayton 1996). We write the vector B as (a, Vator 
where a is a subvector of unpooled coefficients and each 
y,, for /=1,...,L, is a subvector of coefficients (y,,) to 
which we fit a hierarchical model: 


ind 


Wee Ont) ke Laccek, 


Setting t, to zero corresponds to excluding a set of 
variables; setting t, to » corresponds to a noninformative 
prior distribution on the y,, parameters. 

Given the responses y, in categories j, we construct an 
nxJ categorization matrix C, for which C i = Ag 
respondent 7 is in cell 7. Let Z = CX. The model (1) then can 
be written in the standard form of a hierarchical logistic 
regression model as 


y; ~ Bernoulli(p;) 
logit(p,) = ZB 
B ~ NO,>>), 
where ys ‘isa diagonal matrix with 0 for each element of a, 
followed by 1,” for each element of y,, for each /. We use 
the notation p,, for the probability corresponding to the unit 


i, as distinguished from T the aggregate probability 
corresponding to the category 7. See Nordberg (1989) and 
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Belin, Diffendal, Mack, Rubin, Schafer and Zaslavsky 
(1993) for general discussions of hierarchical logistic 
regression models for survey data. 


2.4 Inference Under the Model 


To perform inferences about population quantities, we 
use the following empirical Bayes strategy: first, estimate 
the hyperparameters t,, given the data y; second, perform 
Bayesian inference for the regression coefficients B, given 
y and the estimated t,’s; third, compute inferences for the 
vector of cell means a = logit”! (XB); fourth, compute 
inferences for population quantities by summing Nj)7;’s. 
We view this approach as an approximation to the full 
Bayesian analysis, which averages over the parameters 1,. 
The two approaches will differ the most when components Tt, 
are imprecisely estimated or are indistinguishable from 0 
(see for example, Gelman, Carlin, Stern and Rubin (1995), 
Section 5.5). In the example we consider here, this is not a 
problem because the various components are clearly esti- 
mated to be different from 0. If this were not the case, it 
would probably be worth putting in the additional 
programming effort for a full Bayes analysis. The focus of 
this paper, however, is on the effectiveness of combining 
hierarchical modeling with poststratification, not on the 
relatively minor technical differences between Bayes and 
empirical Bayes analyses. 

The shrinkage of the cell estimates comes in the second 
step, and the amount of shrinkage depends both on the 
sample sizes n, and the data y . More shrinkage occurs for 
smaller values of m. and for values of y. far from the 
predictions based on the logistic regression model. In 
addition, more shrinkage occurs if the parameters t, are 
small. A batch of coefficients y, with little predictive 
power will be shrunk toward zero in the estimation, because 
t, will be estimated to have a small value. This is how we 
can include a large number of coefficients in the hierar- 
chical model without the estimates of population quantities 
becoming too variable. 


3. APPLICATION: BREAKING DOWN 
NATIONAL SURVEYS BY STATE 


3.1 Survey Data 


We apply the above methodology to state-by-state results 
from seven national opinion polls of registered voters 
conducted by the CBS television network during the two 
weeks immediately preceding the 1988 U.S. Presidential 
election. To follow our general notation, we assign y. = 1 
to supporters of Bush and y, = 0 to supporters of Dukakis; 
we discard the respondents who expressed no opinion 
(about 15% of the total; we follow standard practice and 
count respondents who “lean” toward one of the candidates 
as full supporters). Since no data were collected from 
Hawaii and Alaska, only the 48 contiguous states are 
included in the model. Washington, D.C., although 
included in the surveys, was excluded from this analysis 
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because its voting preferences are so different from the 
other states that a generalized linear model that fit the 48 
states would not fit D.C. well, and as a result, the data from 
D.C. would unduly influence the results for the states. Since 
there are few observations for the smaller states and the 
between-poll variation in the estimated support for Bush is 
within binomial sampling variability (as measured by a 7 
test of equality of the proportions of support for Bush in the 
seven polls), we combine the data from all the polls. 

CBS creates survey weights by raking on the following 
variables, with default classifications for item nonresponse 
shown in brackets: 


Census region: Northeast, South, North Central, West 


SEX! male, female 

ethnicity: black, [white/other] 

age: 18-29, 30-44, [45-64], 65+ 

education: not high school grad, [high school grad], 


some college, college grad. 


The raking includes all main effects plus the interactions 
of sex x ethnicity and age x education. We include all 
these variables as fixed effects in our logistic regression 
model, excluding from our analysis the relatively few 
respondents with nonresponse in any of the demographic 
variables. The CBS weights also correct for number of 
telephone lines and number of adults in household, which 
affect sampling probabilities; these have minor effects on 
estimates for Presidential preference (see Little 1996, 
chapter 3), and we do not include them in our model. 
Further details of the CBS survey methodology and 
adjustment appear in Voss, Gelman, and King (1995). 

Our model goes beyond the CBS analysis by including 
indicators for the 48 states as random effects, clustered into 
four batches corresponding to the four census regions. We 
check the performance of the model by comparing estimates 
for each state to the observed Presidential election. 
(Opinion polls just before the election are reliable indicators 
of the actual election outcome; see, e.g., Gelman and King 
1993.) We also compare the stability of estimates based on 
different polls over a short period of time. 


3.2 Population Data for Poststratification 


In order to poststratify on all the variables listed above, 
along with state, we need the joint population distribution 
of the demographic variables within each state: that is, 
population totals N, . for each of the 2x 2x 4x 48 cells of 
sex x ethnicity x age x state. Since the target population 
is registered voters, we should use the population distri- 
bution of registered voters. As an approximation to that 
distribution we use the crosstabulations available in the 
Public Use Micro Survey (PUMS) data for all citizens of 
age 18 and over. The PUMS data contain records for 5% of 
the housing units in the U.S. and the persons in them, 
including over 12 million persons and over 5 million 
housing units. These data are a stratified sample of the 
approximately 15.9% of housing units that received 
long-form questionnaires in the 1990 Census. Persons in 
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institutions and other group quarters are also included in the 
sample. Weights are given for both the housing unit and 
persons within the unit based on sampling probabilities and 
adjustment to Census totals for variables included in the 
short-form questionnaire. We use the weighted PUMS data 
to estimate N. for each poststratification category and 
ignore sampling error in these numbers. The weighted 
PUMS numbers are very similar to the poststratification 
numbers used by CBS in their raking (see Little 1996, 
chapter 3). 


3.3. Results 


We present results for four methods applied to the 
combined data from the seven surveys: 


1. Classical estimate based on raking by demographic 
variables (region, sex, ethnicity, age, education, sex x 
ethnicity, and age x education). This is very close to the 
weighting method used by CBS. For estimates of results 
by states, we perform weighted averages within each 
state, using the weights obtained by the raking. 

2. Regression estimate using the demographic variables 
and also indicators for the states, with no hierarchical 
model (i.e., “fixed-effects” regression). This is very 
similar to using iterative proportional fitting to rake on 
states as well as demographics. The state-by-state 
estimates from this model should improve upon those 
obtained by raking on demographics because the 
estimates of m,;’s are weighted by the population 
numbers WN. rather than the sample numbers n, within 
each state. 

3. Regression estimate using only the demographic 
variables, with the state effects set to zero. This model 
allows the average responses within states to differ only 
because of demographic variation; to the extent that the 
demographics do not explain all the variation in opinion, 
the model should underestimate the variability between 
States. 

4. Regression estimate using the demographic variables, 
with the 48 state effects estimated with a hierarchical 
model (in the notation of Section 2, L = 4 and K,,K,, 
K,, K, = 12, 13, 12,11). We expect this model to 
perform best, both because of the flexibility of the 
hierarchical regression model and because the post- 
stratification uses the population numbers Nj 


We fit each of the regression models to the survey data, 
obtain posterior simulation draws for each coefficient 
(conditional on the estimated T1, T, Tz, T,), and reweight 
based on the PUMS data to obtain poststratified estimates 
for the proportion of registered voters in each state who 
support Bush for President. 

Table 1 presents the raking estimate and the posterior 
medians and interquartile ranges for the three models, along 
with data on the survey responses and the actual election 
outcome. Table 2 gives the nationwide and mean absolute 
statewide prediction errors for the raking and the three 
models. The four methods give almost identical results at 
the national level; the real gain from the model-based 
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estimates occurs in estimating the individual states. The 
reduction in mean absolute prediction error from about 6% 
to 5% can be attributed to using the poststratification 
information, with the further reduction to 3.5% attributable 
to the hierarchical modeling. In addition, the last two lines 
of Table 2 show that the uncertainty estimates from the 
hierarchical model are short and relatively well calibrated 
(slightly less than half of the true values fall inside the 50% 
intervals, which is reasonable since these intervals account 
only for sampling error and not for nonsampling errors and 
changes in opinion). 

Figure 1 plots, by state, the actual election outcomes vs. 
the raking estimates and the posterior medians for the three 
models. As one would expect, the hierarchical model 
reduces variance, and thus estimation error, by shrinkage. 
Although the four methods correct the bias of the nation- 
wide estimate by about the same amount, they act differently 
on the individual states, with the hierarchical model 
performing best. Figure 2 compares the prediction errors for 
the hierarchical and raking estimates for the states. 

Interestingly, the hierarchical model does not seem to 
shrink the data enough to the nationwide mean: we can tell 
this because, in Figure 1d, the actual election outcome is 
higher than predicted for low-predicted values, and lower 
than predicted for high-predicted values. Undershrinkage 
means that the estimated parameters 7, are probably higher 
than their true values, which could be caused by a pattern of 
nonignorable nonresponse that varies between states so that 
observed variability in the state proportions is caused by 
varying nonresponse patterns as well as actual variation in 
average opinions (see Little and Gelman 1996, for a 
discussion of this example and Krieger and Pfeffermann 
1992, for a more general treatment). The undershrinkage 
could be quantified by comparing the estimated to the 
optimal level of shrinkage, but this comparison can only be 
made after the true values are observed. 

It is also possible to compare the models by fitting each 
separately to each survey and examining the stability of 
estimates over a short period of time. This would be a more 
reasonable way to study the models in the common situation 
that the true population means never become known. 
Figure 3 displays, for each of our seven surveys, the 
estimates from raking and from the hierarchical model. 
(When modeling the surveys individually, we fit a common 
hierarchical variance for all 48 states because there was not 
enough data to obtain reliable maximum likelihood 
estimates for the four regions separately from the data in 
each poll.) Results are shown for the entire United States 
and for three representative states: California (a large state), 
Washington (mid-sized), and Nevada (small). For 
convenience, the plot also show the estimates based on the 
seven surveys pooled and the actual election outcomes. For 
all the individual states, the hierarchical estimate is less 
variable over time than the raking estimate. The pattern is 
clearest in Nevada, where the sample size for the individual 
surveys was so low that the raking estimate degenerated to 
0 or 1 in most cases, but the better performance of the 
hierarchical model is clear in the other states as well. For 
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Table 1 
By state: election results (proportion of the two-party vote in 1988 received by Bush); survey data (unweighted mean and sample size) from 
the combined surveys; raking estimate using CBS variables; and posterior median (and interquartile range; that is, width of the central 50% 
uncertainty interval) of poststratified estimates based on state effects unsmoothed, set to zero, and fit by a hierarchical model. 
Estimates are labelled 1, 2, 3, 4 corresponding to the descriptions in Section 3.3. 


Poststratification estimates (and IQRs) 


SS SS a ae 
AL 0.60 134 0.72 0.67 0.63 (0.05) 0.56 (0.01) 0.62 (0.05) 
AR 0.57 86 0.57 0.53 0.53 (0.06) 0.60 (0.01) 0.55 (0.06) 
AZ 0.61 141 0.62 0.61 0.62 (0.05) 0.56 (0.02) 0.61 (0.05) 
CA 0.52 1075 0.57 0.53 0.55 (0.02) 0.53 (0.01) 0.55 (0.02) 
CO 0.54 126 0.59 0.59 0.58 (0.06) 0.57 (0.01) 0.57 (0.05) 
Cr 0.53 103 0.53 0.55 0.52 (0.06) 0.49 (0.02) 0.51 (0.06) 
DE 0.56 30 0.40 0.37 0.42 (0.11) 0.60 (0.01) 0.52 (0.08) 
lnb, 0.61 553 0.64 0.62 0.61 (0.03) 0.62 (0.01) 0.61 (0.03) 
GA 0.60 211 0.62 0.58 0.56 (0.04) 0.56 (0.01) 0.56 (0.04) 
IA 0.45 102 0.38 0.38 0.38 (0.06) 0.59 (0.01) 0.41 (0.06) 
ID 0.63 31 0.52 0.58 0.52 (0.12) 0.59 (0.02) 0.55 (0.08) 
iN, 0.51 429 0.55 0.52 0.53 (0.03) 0.52 (0.01) 0.52 (0.03) 
IN 0.60 Dil O75 0.73 0.74 (0.04) 0.56 (0.01) 0.72 (0.04) 
KS 0.57 105 0.72 0.71 0.71 (0.06) 0.57 (0.01) 0.68 (0.05) 
KY 0.56 146 0.57 0.53 0.56 (0.05) 0.64 (0.01) 0.57 (0.05) 
LA 0.55 153 0.62 0.60 0.61 (0.05) 0.54 (0.01) 0.59 (0.04) 
MA 0.46 277 0.47 0.41 0.46 (0.04) 0.50 (0.02) 0.47 (0.04) 
MD 0.51 207 0.52 0.50 0.49 (0.04) 0.56 (0.01) 0.50 (0.04) 
ME 0.56 44 0.52 0.52 0.55 (0.10) 0.52 (0.02) 0.54 (0.08) 
MI 0.54 399 0.58 0.55 0.57 (0.03) 0.54 (0.01) 0.57 (0.03) 
MN 0.46 210 0.54 0.53 0.53 (0.05) 0.59 (0.01) 0.53 (0.04) 
MO 0.52 235 0.46 0.43 0.46 (0.04) 0.55 (0.01) 0.47 (0.04) 
MS 0.61 170 0.69 0.70 0.65 (0.04) 0.53 (0.01) 0.63 (0.04) 
MT 0.53 Bil 0.39 0.40 0.40 (0.12) 0.58 (0.02) 0.50 (0.09) 
NC 0.58 239 0.59 0.60 0.55 (0.04) 0.58 (0.01) 0.55 (0.04) 
ND 0.57 54 0.56 0.56 0.55 (0.09) 0.58 (0.01) 0.56 (0.08) 
NE 0.61 90 0.58 0.60 0.56 (0.07) 0.58 (0.01) 0.56 (0.06) 
NH 0.63 20 0.70 0.68 0.73 (0.13) 0.53 (0.02) 0.61 (0.10) 
NJ 0.57 301 0.57 0.60 0.53 (0.04) 0.46 (0.01) 0.53 (0.03) 
NM 0.53 87 0.55 0.54 0.57 (0.07) 0.54 (0.02) 0.56 (0.06) 
NV 0.61 19 0.68 0.80 0.67 (0.13) 0.56 (0.02) 0.60 (0.09) 
NY 0.48 639 0.42 0.37 0.41 (0.03) 0.45 (0.01) 0.41 (0.02) 
OH 0.55 454 0.62 0.63 0.58 (0.03) 0.55 (0.01) 0.58 (0.03) 
OK 0.58 93 0.57 0.62 0.59 (0.07) 0.63 (0.01) 0.60 (0.06) 
OR 0.48 ii 0.50 0.47 0.50 (0.06) 0.58 (0.02) 0.52 (0.06) 
PA 0.51 431 0.54 0.54 0.52 (0.03) 0.48 (0.02) 0.52 (0.03) 
RI 0.44 65 0.28 0.29 0.27 (0.07) 0.50 (0.02) 0.34 (0.06) 
SC 0.62 151 0.70 0.67 0.66 (0.05) 0.55 (0.01) 0.64 (0.04) 
SD 0.53 52 0.54 0.51 0.53 (0.09) 0.58 (0.01) 0.54 (0.08) 
T™ 0.58 252 0.68 0.69 0.66 (0.04) 0.60 (0.01) 0.65 (0.03) 
TX 0.56 594 0.58 0.52 0.56 (0.03) 0.60 (0.01) 0.56 (0.02) 
UT 0.67 61 0.80 0.85 0.79 (0.07) 0.60 (0.02) 0.72 (0.06) 
VA 0.60 255 0.69 0.72 0.67 (0.04) 0.59 (0.01) 0.66 (0.03) 
Aa 0.52 12 0.54 0.58 0.60 (0.19) 0.53 (0.02) 0.55 (0.11) 
WA 0.49 269 0.47 0.41 0.46 (0.04) 0.58 (0.01) 0.48 (0.04) 
WI 0.48 264 0.49 0.53 0.48 (0.04) 0.57 (0.01) 0.49 (0.04) 


WV 0.48 79 0.48 0.52 0.48 (0.07) 0.65 (0.01) 0.53 (0.06) 
WY 0.61 13 0.50 0.36 0.59 (0.17) 0.59 (0.02) 0.59 (0.10) 
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Table 2 
Summary statistics for raw mean of responses, raking estimate, and three poststratified estimates from the combined surveys. Summaries 
given are the estimated mean of the 48 state vote proportions weighted by state voter turnout (thus, estimated national popular vote 
proportion for Bush excluding Alaska, Hawaii, and the District of Columbia); the mean absolute error of the 48 state estimates; the average 
width of the 50% intervals for the states; and the number of the 48 states whose true values fall within the 50% intervals. 


S Peal It Unweighted Raking State effects State effects Hierarchical 
geet ‘Loeb mean estimate unsmoothed set to 0 model 
Mean of national popular vote 0.539 0.568 0.549 0.548 0.547 0.550 
Mean absolute error of states - 0.056 0.066 0.049 0.048 0.035 
Average width of 50% intervals - - - (0.069) (0.016) (0.057) 
Number of states contained in 50% interval - a - 18 3 20 
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Figure 1. Election result by state vs. posterior median estimate for (a) raking on demographics, (b) regression model 
including state indicators with no hierarchical model, (c) regression model setting state effects to zero, 
(d) regression model with hierarchical model for state effects. 


example, it was not reasonable to assign Bush only 46% of 
the support in California (in the poll 3 days before the | 
| 


election) or only 30% of the support in the state of 
Washington. For the United States as a whole, however, 
the two estimates are quite similar (in fact, when all seven 
polls are combined, the raking estimate performs very 
slightly better), indicating once again that the benefits from 
the modelling approach appear when studying subsets of 
the population. 

The results for Washington have the surprising property 
that the regression estimate based on the combined surveys 
(shown at time “-1” on the graph) is lower than the seven 
estimates from the original surveys. This occurs because 
the data from the combined surveys show that the state of 
Washington UNO as Bush less than would be predicted Figure 2. Scatterplot of prediction errors by state for 
merely by controlling for the demographic covariates (that the hierarchical model vs. the raking 
prediction would be the estimate for Washington from the estimate. The errors of the hierarchical 
model with state effects set to zero, which from Table 1 is model are lower for most states. 
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Bush support (California) 


Bush support (Nevada) 


days to election 


Estimated Bush support estimated separately from seven individual polls taken shortly before the election for (a) the entire U.S. 


(excluding Alaska, Hawaii, and the District of Columbia), (b) a large state (California), (c) a medium-sized state (Washington), and 
(d) a small state (Nevada). Each plot shows the raking estimates as a dotted line and the estimates from hierarchical model as a solid 
line, with error bars indicating 50% confidence bounds for the raking and 50% posterior intervals for the model-based estimates. 
The polls were taken between nine and two days before the election. Estimates based on the combined surveys are shown at time 
“—]”, and the actual election result is shown at time “0” on each plot. 


0.58). But none of the individual surveys, taken alone, had 
enough data to make a convincing case that Washington 
was so far from the national mean, and so the Bayes 
estimate shrunk their estimates to a greater extent. This 
behavior, while it may seems strange at first, is in fact 
appropriate: with a smaller survey, there is less information 
about the individual poststratification categories, and the 
model-based estimate produces an estimate for each 
category that is closer to the sample mean. When all seven 
surveys are combined, more information is available, and 
the model relies more strongly on the data in each category. 
This is how the Bayes procedure essentially balances the 
concerns of poststratifying on too few or too many 
categories. 


4. DISCUSSION 


Poststratification is the standard method of correcting for 
unequal probabilities of selection and for nonresponse in 
sample surveys. From the modelling perspective, raking or 
poststratification on a set of covariates is closely related to 


a regression model of responses conditional on those co- 
variates, with population quantities estimated by summing 
over the known distribution of covariates in the population. 
Conditioning on more fully-observed covariates allows one 
to include more information in forming population 
estimates, but it is well known that raking on too large a set 
of covariates yields unacceptably variable inferences. We 
propose a method of poststratification on a large set of 
variables while fitting the resulting regression with a 
hierarchical model, thus harnessing the well-known 
strengths of Bayesian inference for models with large 
numbers of exchangeable parameters. 

The Bayesian poststratification is most useful for 
estimation in subsets of the population (e.g., individual 
states in the U.S. polls) for which sample sizes are small. 
A related area in which modeling should be effective is in 
combining surveys conducted by different organizations, 
modeling conditional on all variables that might affect 
nonresponse in either survey. In addition, the methods in 
this paper can obviously be applied to continuous responses 
by replacing logistic regressions by other generalized linear 
models. 
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Our purpose in Bayesian modeling is not to fit a 
subjectively “true” model to the data or the underlying 
responses, but rather to estimate with reasonable accuracy 
the average response conditional on a large set of 
fully-observed covariates. More accurate models of the 
responses should allow more accurate inferences — but even 
the simple exchangeable mixed effects model we have fit, 
with hyperparameters estimated from the data, should 
perform better than the extremes of the fixed effects model 
or setting coefficients to zero. Ultimately, the goal of 
probability modeling and Bayesian inference in a sample 
survey context is to allow one to make use of abundant 
poststratification information (e.g., census data classified by 
sex, ethnicity, age, education, and state) to adjust a 
relatively small sample survey. 

Difficulties with modeling approaches such as ours 
could arise in several ways. If one adjusts to a large number 
of categories using too weak a model (such as the model 
with unsmoothed state effects), the resulting estimates can 
be too variable. If the population distributions of the 
variables used in the poststratification are not available (for 
example, adjusting to a variable that is not measured or is 
measured inaccurately by the Census), then the N.’s must 
be modeled also, which requires additional work. Of 
course, such additional work would be required to rake on 
these variables as well. Since all of the methods, including 
raking and regression methods, assume ignorable models, 
they will yield incorrect inferences when unmeasured 
variables affect nonresponse and are correlated with the 
outcome of interest. 

The methods described here are intended as an impro- 
vement upon raking-type poststratification adjustments and 
are not intended to, by themselves, correct for nonignorable 
nonresponse. However, by allowing one to adjust for more 
variables, the Bayesian poststratification should allow the 
use of models for which the ignorability assumption is more 
reasonable. Having a large number of poststratification 
categories (e.g., in 48 states) creates problems with classical 
weighting methods because many categories will have few 
or even no respondents. Interestingly, however, having 
many categories can make Bayesian modeling more 
reliable: more categories means more random effects in the 
regression, which can make it easier to estimate variance 
components. 
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APPENDIX: COMPUTATION 


We use an EM-type algorithm to estimate the hyper- 
parameters t,; given these, we sample from the posterior 
distribution of the coefficients B using a normal approxi- 
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mation to the logistic regression likelihood. We use this 
approximation for its simplicity and because it is reasonable 
for fairly large surveys, as in our application in Section 
application; if desired, more exact computations can be 
performed using the Gibbs sampler and Metropolis 
algorithm (see Clayton 1996), perhaps using the algorithm 
described here as a starting point. 

When the data distribution is normal and the means are 
linear in the regression coefficients, the EM algorithm can 
be used to obtain estimates of the variance components 
(Dempster, Laird, and Rubin 1977), treating the vector of 
coefficients B as “missing data.” In this framework, the 
“complete-data” loglikelihood for Tt, is 


K) 
1 2 
L(t,| y,) = const - K,logt, - eas ye Via 
2T, kel 


ee at, 5 K ; 
so the sufficient statistic for t, is t(y,) = Vy Yip Given the 
current estimate 1’, the expected sufficient statistic is 


E(t(y,) | y, 0°) = 


| Ey, | y, P|? + trace(var(y, | y, 7). 

Since these two terms are not analytically tractable for our 
model, we use the following approximations which are 
easily obtained: (1) approximate E(y,| y,1°") with an 
estimate 7,, based on y and the estimate con and) 
approximate var(y, | y, wT) from the curvature of the 
log-likelihood at the estimate, V,, = (- L” (,))’. We update 
these approximations iteratively for all /=1,..., 2 simulta- 
neusly, converging to an approximate maximum likelihood 
estimate (#,,...,%,). Given an initial guess 1°, the 
algorithm proceeds by iterating the following two steps to 
convergence. 

Approximate E-step. Solve the likelihood equations 
iteratively, as described below. Use the estimate B to obtain 
an approximation to E(¢t(y,) | y, 1"), for each / = 1,..., L. 

We solve the likelihood equations d/dB L(B|y, t) = 0 
using iteratively weighted least squares, involving a normal 
approximation to the likelihood p(y|B) =I],p0;,|B), 
based on locally approximating the logistic regression 
model by a linear regression model (see Gelman et al. 1995, 
p. 391). Let n, = (ZB), be the linear predictor for the i-th 
observation. Starting with the current guess of B, let 
f| = ZB. Then a Taylor series expansion to L(y, | n,) gives z, = 
N(n;, o,); where 


wen Mt exPeay? { expla) 
i 1 exp(fi) 1 + exp(fy) 


a _ (L# exp(y))? 
exp(f,) 


i 


Let ey denote the value of Ys based on plugging in the 
current estimate 7, and let yy = diag(o;). Then we obtain 
an updated estimate and variance matrix using weighted 
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least squares based on the normal prior distribution and the 
normal approximation to the logistic regression likelihood: 


6 =(Z' > 17 £ 3 av Oy ve (2) 
ECA 3s i 2) So (3) 


We iterate until convergence and then use 6 and the 
appropriate elements of Ve to estimate var(y, | y, sy: 

M-step. Maximize over the parameters t, to obtain 
ee MEI Yt. wig) ctor cach f=ol, 1.1. Sette" 
to t"” and return to the approximate E-step. 

Once the approximate EM algorithm has converged to an 
estimate 7, we draw B from a normal approximation to the 
conditional posterior distribution p(B | y,t), using the 
values from equations (2) and(3) at the last EM step as the 
mean and variance matrix in the normal approximation. For 
each draw of the vector parameter B, we compute the 
category means, z = logit’ (XB), and any population totals 
of interest, counting each category 7 as N; units in the 
population. 
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Estimating the Population and Characteristics of Health Facilities 
and Client Populations Using a Linked Multi-Stage 
Sample Survey Design 


K.K. SINGH, A.O. TSUI, C.M. SUCHINDRAN and G. NARAYANA' 


ABSTRACT 


This paper demonstrates the utility of a multi-stage sample survey design that obtains a total count of health facilities and 
of the potential client population in an area. The design has been used for a state-level survey conducted in mid-1995 in 
Uttar Pradesh, India. The design involves a multi-stage, areal cluster sample, wherein the primary sampling unit is either 
an urban block or rural village. All health service delivery points, either self-standing facilities or distribution agents, in 
or formally assigned to the primary sampling unit are mapped, listed, and selected. A systematic sample of households is 
selected, and all resident females meeting predetermined eligibility criteria are interviewed. Sample weights for facilities 
and individuals are applied. For facilities, the weights are adjusted for multiplicity of secondary sampling units served by 
selected facilities. For individuals, the weights are adjusted for survey response levels. The survey estimate of the total 
number of government facilities compares well against the total published counts. Similarly the female client population 
estimated in the survey compares well with the total enumerated in the 1991 census. 


KEY WORDS: Sample survey; Program evaluation; Health services; Developing country. 


1. INTRODUCTION 


The evaluation of the impact of health programs on 
population-level health outcomes often requires knowledge 
of the number and characteristics of facilities and potential 
clients. Such information is frequently lacking in develop- 
ing countries where program record keeping and vital 
registration systems tend to be incomplete and poorly 
maintained. 

To obtain current information on health status, health 
service use, service performance, and client needs, pro- 
grams have resorted to occasional sample surveys, often 
designed and conducted independently and subareally 
(Aday 1991; Ross and McNamara 1983). Some demogra- 
phic and health surveys (Macro International 1996), 
however, do provide a national profile of population-level 
health outcomes, such as fertility, child mortality, and 
nutritional well-being. The distinct advantage of a national 
population sample for planning health programs is its ability 
to measure the attitudes and behaviors of clients as well as 
non-clients. Program service statistics are limited to actual 
clients and may not yield the most current or accurate 
picture of service use. 

In addition to client behaviours, it is useful to monitor 
the accessibility and quality of services, but this requires a 
separate review of service provision at health facilities or 
related outlets. Efforts in developing countries, like the 
situation analysis studies (Miller, Ndhiovu, Gachara and 
Fisher 1991), involve probability surveys of health facilities 


and can provide a national overview of program perfor- 
mance. However, often they are restricted to reviewing 
public health programs because of incomplete registration 
of private health providers, such as private clinics or 
pharmacies. The lack of complete and accurate registration 
of private-sector service providers prevents probability 
sample surveys from being used to monitor health care 
patterns through this sector. 

Constraints on available resources to expand and improve 
the delivery of health care in developing, as well as developed, 
countries are increasing. This suggests that a more efficient 
use of resources available for monitoring and evaluation, 
particularly through surveys, is a consideration for all 
concerned. Innovative approaches to sample surveys should 
be developed to provide health planners and managers witha 
maximum of information at aminimum of precision loss. 

We present results from a multi-stage, cluster sample 
survey designed to estimate the population and charac- 
teristics of health facilities and target client populations. 
The cluster sample for the survey, conducted in the large 
northern Indian state of Uttar Pradesh, is used as a basis for 
selecting health facilities and households, with subsequent 
selection of service staff from the facilities and of married 
women of childbearing age from the households. The 
survey was designed to generate independent samples of 
health facilities, staff, households, and client populations 
for the health services. 

The next section of this paper will describe the survey 
design, its contents, and fieldwork procedures as applied in 
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Uttar Pradesh. The following section presents the com- 
parative results on health facilities and population, and the 
last section will discuss lessons learned for survey design 
from the Uttar Pradesh application. These lessons will be 
important specifically for this survey’s planned replication 
in two years but generally informative for other countries 
that may adopt the linked design. 


2. THE PERFORM SURVEY IN 
UTTAR PRADESH 


The PERFORM (Project Evaluation Review For 
Organizational Resource Management) Survey was de- 
signed to measure benchmark indicators for a large family 
planning project called the Innovations in Family Planning 
Services (IFPS) project sited in Uttar Pradesh and co- 
funded by the Government of India and the U.S. Agency for 
International Development. Uttar Pradesh has a population 
of over 140 million and by itself would rank as the fifth 
largest developing country. 


2.1 Content 


Indicator estimates for IFPS are needed at three levels: 
(1) public and private service delivery points (SDPs), 
(2) service providers staffing the SDPs or facilities, and 
(3) client population, represented by women of reproductive 
age. As IFPS seeks to improve the family planning service 
environment, it is imperative to obtain measures of 
indicators at this level but in such a way as to be relatable to 
the women resident in those environments. 

As a result, the PERFORM survey developed seven 
questionnaires: 


1-2) An urban block and village questionnaire to inventory 
all potential and actual providers of health services in 
the sampled village or urban block; 

3) A fixed service delivery point (FSDP) questionnaire 
to gather information on the staff, services, 
equipment, supplies, and education and motivation 
activities at sampled public and private facilities. 

4) A staff questionnaire administered to all FSDP staff 
involved in family planning services (identified from 
the FSDP questionnaire) to assess their capabilities 
and service experiences; 

5) An individual service agent (ISA) questionnaire to all 
individuals working outside of self-standing facilities 
(FSDPs) who currently or potentially can provide 
health planning services, such as private doctors, 
pharmacists, midwives, lay health workers, and 
retailers; 

6) A household questionnaire to be administered to 
heads of the sampled households to enumerate 
household members and selected demographic and 
social characteristics; 

7) An individual questionnaire for currently married 
women between the ages of 13 to 49 (identified from 
the household questionnaire) to collect information on 
knowledge of and past, current, and intended use of 


health services, recent pregnancy and contraceptive 
behaviors, and additional background characteristics. 


2.2 Sampling Design 


PERFORM was designed to provide estimates of facility 
and population characteristics at the state, regional, 
divisional, and district levels. The district was important 
since it was the focal point for introducing innovative 
approaches and additional IFPS inputs. At the time of the 
survey design, Uttar Pradesh had 14 administrative 
divisions; two districts were selected from each using 
probability proportional to size (PPS) procedures. These 
areal units have administrative-political boundaries and thus 
public administration utility. The districts were also 
aggregated into five regional groupings. 

In each district, the total number of households to be 
sampled was fixed at 1,500. A sample of 1,500 households 
per district was determined to be sufficient to provide 
estimates for the main population level indicators. An 
overall target sample size of 1,627 ever-married women 
aged 13-49 was required to detect a change of 5 per cent 
point in contraceptive prevalence (with a =0.05 and 
1 - B =0.90) at district level. It is expected that the number 
of ever-married women aged 13-49 per household would be 
1.15 and therefore, by visiting a sample of 1,415 households 
the required number of ever-married women would be 
obtained. Allowing for an increase of 5 per cent to 
accommodate non-response and non-availability, a target 
sample of 1,725 ever-married women aged 13-49 from the 
1,500 households was considered to be sufficient. The 
schematic diagram of the sample design is given in Figure 1. 

The districts were further stratified into rural and urban 
areas. According to the Census of India, all places with a 
municipality, a municipal corporation, a cantonment board, 
a notified area committee, or all other places with a mini- 
mum of 5,000 population, with at least 75 percent of the 
male working population engaged in non-agricultural 
pursuits and a population density of at least 400 persons per 
square kilometer, are classified as urban areas. Urban 
blocks and rural villages served as the secondary sampling 
units (SSUs). The 1,500 households to be sampled from 
each district were allocated to the rural and urban areas in 
proportion to the size of population within the district. 
However, if the allocated proportion of urban population 
was less than 20 percent, the allocation of households in the 
urban area was fixed at 20 percent. This allocation was 
prescribed to ensure coverage of a sufficient number of 
health delivery points. 

Households within rural areas were selected using a 
stratified two-stage sampling plan. The villages in the rural 
areas were first stratified into four strata depending on the 
size of the of the population as follows: 


Stratum Population size of the village 
I 100 - 499 
II 500 - 1,999 
Il 2,000 - 4,999 
IV 5,000 and above. 
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Figure 1. Schematic Diagram of PERFORM Sample Design 
Villages with fewer than 100 residents or 20 households 
were excluded from the list (such villages were rare in the 
present study). The number of villages to be selected from 
each district was allocated proportionally to each of the four 
strata. Villages were selected by first arranging them within 
the stratum by the female literacy rates and then selecting 
the required number of villages by a PPS sampling 
procedure. All households in the selected villages were 
listed and mapped, and a target number of 20 households 
was drawn from each selected village using systematic 
sampling. Villages with more than 500 households or with 
a population size of 2,500 or more (some in stratum III and 
all in stratum IV) were segmented into four parts, and two 
segments were selected for household listing and selection. 
The required 20 households were selected taking ten 
households from each segment using systematic random 
sampling. 

Households in urban areas were also selected using a 
stratified two-stage sampling plan. The towns in the urban 
areas of a district were stratified into two strata according 
to population size as follows: 


Stratum Population size of the town 
I 100,000 and more 
Il Fewer than 100,000. 


All towns within stratum I were selected with certainty. 
Towns in stratum II were arranged according to population 
size and the required number of towns were selected by 
PPS. From each sampled town a minimum of two blocks 
were selected using PPS methods. All households in the 
selected blocks were listed and mapped, and 15 households 
were selected from each urban block using systematic 
random sampling. 
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2.2.1 District Selection Probability 


Let m, denote the population of the k-th district within 
a division. Because two districts must be selected from 
each division, the probability of selecting the k-th district 
from a division r, is obtained as 


Hy 
r= 2 * — 
M 
where M is the total population of the division 
(M=Y).,m ,) and fis the total number of districts in the 
division. 


2.2.2 Village and Household Selection Probability 


Let Nijk denote the number of households in the i-th 
village, j-th stratum and k-th district. Then, Pix? the 
probability of selecting village i from the j-th stratum and 
k-th district is obtained as, 


n 
P; =qQ., * ile x Pr 
ijk Jk Ny k 

where a, and N, are, respectively, the number of villages 
sélected and the total number of households in the j-th 
stratum and k-th district. 

Let q; ik be the probability of selecting a household from 
the rural areas of a selected district. Then 4 jj. May be given 
as 


20 
Vik ~ Pik aa 
ijk 


where 20 is the number of households drawn from the 
selected village. 

The weights for villages and households are then the 
inverse of their selection probabilities, i.e., 1/p,,, and ie 
and are denoted as VW,,, and HW,,, respectively. 


2.2.3 Town, Urban Block and Household Selection 
Probability 


The probability of selecting the j-th town from the k-th 
district, tip is obtained as 


ty,al if the population of the town is > 100,000 
if the population of the town is < 100,000 


where s,, is the total number of households in the j-th town 
(with a population < 100,000) in the k-th district, c, is the 
number of towns selected in district k, and S, is the total 
number of households in towns with less than 100,000 
population in district k. 

Let u,, denote the probability of selecting the i-th urban 
block from the j-th town and k-th district. Then u;, is 
obtained as 
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where 5, is the number of urban blocks selected and Yjx is 

the total number of households in the j-th town and k-th 

district, and x., is the number of households in the i-th 
block, j-th town and k-th district. 

The probability of selecting a household from the i-th 

urban block and the k-th district, denoted as Ving is given as, 
Viou., ¥ 


15 
1 al eae 
where 15 is the number of households drawn from the 
selected urban block. 

The weights for urban blocks and households are then 
the inverse of their selection probabilities, i.e., Vu, and 
Wy, and are denoted as OW x and HW,,., respectively. 
Since the population-level estimates are based on indi- 
viduals, all individuals in a selected household received the 
household weight. No selection procedure was used for 


eligible respondents within a household. 


2.2.4 Adjustment for Household Questionnaire for 
Non-response and. Over-sampling of Urban 
Blocks 


The adjustment of the household weight for non- 
response is done under the assumption of random non- 
response within the village (or urban block) and is carried 
out as follows: 

Let n, be the number of households selected and n, be 
the number of households where interviews are completed. 
Then the adjusted weight for households due to non- 
response is defined as 


ny 


HW, = HW y* — 
Lp) 

The final household weight also includes an adjustment 
of proportion of urban population in the district, where an 
over-sampling of urban blocks has occurred (districts with 
less than 20 percent of urban population). 

Let n, be the actual proportion of urban population in a 
district and 1, the proportion of urban population in the 
sample. Then the adjusted weight for households due to 
non-response and over-sampling of urban blocks is defined 
as 


Ws 

EV 
ijk 

nN, 


AW, - 


2.2.5 Selection of Service Delivery Points in Sample 
Districts 


To obtain a probability sample of service delivery points, 
FSDPs and ISAs were selected in relation to the SSUs, i.e., 
the villages or urban blocks, as follows: 


1) All private and public sector health institutions in 
selected rural and urban SSUs; 

2) All sub-centres, primary health centres, community 
health centres, post-partum centers providing services 
to the population in the selected rural SSUs; 


3) All private hospitals with 10 or more beds in the 
nearest town (with fewer than 100,000 population) 
within 30 kms of selected rural SSUs; 

4) All municipal hospitals, district hospitals, and medical 
college hospitals; 

5) Allclinics and hospitals runs by voluntary agencies, the 
organized sector, and cooperatives; and 

6) AllISAs in selected villages and urban blocks. 


It is probably helpful first to describe the organized 
delivery of health care through the government sector. 
Residents of all villages are entitled to obtain health care 
from a government sub-centre (SC), a primary health centre 
(PHC), and a community health centre (CHC). Villages 
with 5,500 population or more often have an SC located 
within their boundaries. Approximately six SCs will report 
to one PHC, and PHCs in turn are linked to a CHC. At 
times the PHC is integrated with the CHC; as a result, our 
estimation must be of CHCs and PHCs combined, while 
SCs are estimated separately. (Population growth has led to 
the establishment of “additional PHCs” and redistricting of 
the original PHC catchment areas. These additional PHCs 
have been included in the estimation of the number of 
PHCs.) All SCs assigned to a sampled village were visited, 
as were their affiliated PHCs and CHCs. 

At the time of listing and mapping households in each 
urban block and village, the FSDPs and ISAs were also 
listed and mapped. In addition, key informants in each SSU 
were interviewed regarding health outlets not visibly 
obvious. The selection of service delivery points - FSDPs 
and ISAs — within the SSU boundaries, or affiliated with 
the government’s health subcentre, involved a full census. 
The one exception to this was for municipal hospitals, 
district hospitals and medical colleges, which were self- 
selected and thus had a weight of unity. The selection 
probabilities of the other FSDPs and ISAs are then a 
function of the probability of selecting the SSU, and the 
inverse of the latter serves as the weight of the FSDP or ISA 
unit. Weights for CHCs, PHCs, and SCs were calculated 
with the procedure below after determining some fieldwork 
“failure” in selecting these types of facilities correctly. 
(This failure is discussed later.) 

Since CHCs and PHCs are associated with more than 
one SSU, we have assumed that one PHC exists per 30,000 
population (which is approximately the actual average for 
Uttar Pradesh) and that one SC serves approximately 5,500 
(actual district averages range from 4,000 to 6,500). Under 
this assumption, the CHC/PHC weight for each selected 
SSU is then 


Weucpuc = Total population 
in selected SSU 


30,000 ROOF Ow 


and the SC weight for each selected SSU is 


W,.. = Total population 
in selected SSU «VW 


RG Lijk (or UW i): 
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All weights for FSDPs that were not self-selected had to 
be adjusted for multiplicity, ie., when an FSDP was 
selected into the sample on the basis of more than one SSU. 
For example, a CHC/PHC might be selected because of two 
sampled SSUs. In this case, the weight for the CHC/PHC 
was the sum of the weights of the two selected SSU, i.e., 
Weoucpyc> 4ssociated with its selection. 


2.3 Survey Implementation 


Fieldwork for the PERFORM Survey was conducted 
from June to September 1995 in Uttar Pradesh. The survey 
was executed by four organizations contracted following a 
competitive procurement process. One organization that 
had tested the PERFORM survey design in one district a 
year earlier served as the nodal or coordinating organi- 
zation. Master training to survey project coordinators and 
supervisors was provided, including a field pretest. The 
actual fieldwork for PERFORM was carried out in six- 
member teams composed of 1 male supervisor, 1 female 
editor, 1 male interviewer and 4 female interviewers. Each 
fieldwork organization on average engaged 3 teams to 
cover one district, or a total of 18 field staff for data 
collection per district (or 21 teams for a total of 126 field 
staff to cover 7 districts). Overall field supervision was the 
responsibility of a specially—appointed four-member team, 
one assigned to each consulting fieldwork organization. 
Following field editing, the questionnaires were transported 
to the home offices of the survey organizations for data 
entry and cleaning. One type of staff person, the auxiliary 
nurse—midwife who is stationed at a subcentre, was difficult 
to reach, even after the standard three attempts. 


3. RESULTS 


Table 1 gives the sample coverage for the PERFORM 
survey, in terms of the number of units selected of each 
type, the number successfully interviewed, and the 
completion rate. The completion rates are very high for 
ample units requiring personal contact — ranging from 94.3 
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for eligible women to 96.7 percent for households. 
Interview completion rates were 95 percent for facilities and 
agents. Only for fixed facility staff was the rate somewhat 
lower at 90 percent, a respectable although not an 
outstanding level. (One type of staff person, the auxiliary 
nurse-midwife who is stationed at a subcentre, was difficult 
to reach, even after the standard three attempts.) 


3.1 Population Size and Characteristics 


We compare first population—level measures on selected 
demographic indicators obtained from other sources with 
those from the PERFORM survey, as shown in Table 2. 
The figures indicate that PERFORM results compare 
favorably with census measures as well as these from the 
recent National Family Health Survey (NFHS) conducted 
in Uttar Pradesh in late 1992 and early 1993, with a sample 
size of 11,438 ever-married women aged 13 to 49. The 
enumerated population shows a growth of almost 10.5 
million persons since the 1991 census, and the percentage 
of households in urban areas is close across all three 
sources. The ratio of women to men is slightly lower in 
PERFORM (891) than in the NFHS (917). The percentage 
of the population in the two age groups (0 to 14 and 65 and 
over) compares well, as does the percentage of households 
belonging to the scheduled castes. The percentage of 
households belonging to scheduled tribes is 3.1, higher than 
the 1.1 observed in the NFHS. This may reflect an actual 
growth in such households with increased in-migration to 
large towns and cities by scheduled tribe members. The 
proportions literate show small gains since the NFHS but 
compare well overall. The total fertility rate and the level 
of modern contraceptive use also are similar and change in 
a consistent direction between the dates of the two Uttar 
Pradesh surveys. Results in Table 2 suggest that 
PERFORM’s sample design, based on _ traditional 
multistage cluster sample designs used for demographic 
surveys, was executed properly to produce state-level 
results comparable to the census and earlier NFHS survey. 
The standard error and design effect of the estimates were 
also given in the Table 


1 


Coverage of Sample Units of PERFORM Survey: Uttar Pradesh, 1995 


Sample Units 
Sample Coverage = : - 
; Urban Eligible Fixed FSDP Individual 
Villages Households 
Blocks Women SDPs Staff Agents 
Number Sampled 1,539 738 42,006 48,009 2,549 7,026 23,364 
Number Interviewed 1,539 738 40,633 45,277 2,428 6,320 22,335 
Percent completed 100.00 100.00 96.7 94.3 955 89.9 95.6 


Notes: Villages and urban blocks served as the primary sampling units; eligibility criteria for women were currently married and 


between ages 13 to 49 years; SDP = service delivery point. 
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Table 2 
Basic Demographic Indicators for Uttar Pradesh, India 
Uttar Pradesh 
Index NFHS 
Census (1991) (1992-93) PERFORM (1995) Standard Error Design Effect 

Population 139512,287 u 149,758,641 1,542,952 = 
Percent urban 19.8 226: 21.6° 0.6553 12.6095 
Sex ratio” 879 917 891 34.1010 0.9727 
Percentage 0-14 years old 39.1 41.8 40.2 0.1306 1.9049 
Percent 65+ years old 3.8 4.8 4.7 0.0513 1.5789 
Percentage scheduled 21.0 18.0? 20.0° 0.3790 3.6536 
Percentage scheduled tribe 0.2 ale Sle 0.1818 4.4694 
Percent Literate® 

Male S57, 65.3 67.6 03352 6.4634 

Female 2523 31.4 37.4 0.3824 8.6821 

Total 41.6 49.9 S8) 033852 12.2385 
Total fertility rate Sl 4.8 4.5 - - 
Modern contraceptive u 18.5° 22.05 0.3499 3.4111 


Unavailable 
Based on number of households 


u= 
> Number of females per thousand males 
d 


In Table 3 we compare the age and sex distributions for 
Uttar Pradesh obtained from the NFHS and PERFORM, as 
well as from the Sample Registration System, operated by 
the Office of the Registrar General. The sex ratios for the 
two surveys are also given. The age-sex distributions are 
again comparable across the three sources. However, there 
is a markedly lower sex ratio for the age group 30-49 years 
(820) in PERFORM and a slightly higher one for ages 
50-64 (993) than those in the NFHS (941 and 960 
respectively). We suspect some of this difference is due to 
a “push” of females out of the end of childbearing ages by 
field investigators of one survey organization to avoid 
completion of the pregnancy calendar and history portions 
of the questionnaire. (Upon further investigation, we found 
the sex ratios for women aged 50-64 to be uniformly higher 
in the seven districts under one organization’s responsibility 
than those of others.) As a result, there are somewhat more 
women aged 50-64 enumerated in the PERFORM Survey 
than may actually be the case. This also may mean that 
births to women who were actually under age 50 were 
under-enumerated. Because this is not a high-fertility age 
group, the bias is not likely to be large. 


Based on population aged 7 and above for the census and population aged 6 and above for NFHS and PERFORM 
Percentage of currently married women aged 15 to 49 using modern contraceptive method. 


3.2 Facility Size and Characteristics 


By visiting and interviewing the facilities selected 
through the SSUs or cluster, we are able to generate an 
independent sample of health facilities and service 
providers. (These include those who currently, as well as 
potentially can, provide family planning services, i.e., not 
all the estimated number of retail outlets (general merchant, 
kirana and pan shops) shown presently dispense contra- 
ceptives.) The weighted counts of these outlets is shown 
in Table 4. Our ability to validate the estimates of 
independent agents is weakened by the fact that many of 
them are not registered, particularly the “unqualified” (or 
quack) doctors. Narayana, Cross and Brown (1994: 
Table 8) report a 1991 total number of 112,568 villages in 
Uttar Pradesh, which would suggest almost one traditional 
birth attendant per village and 1 anganwandi worker for 
every 4.5 villages on average. These ratios appear 
reasonable given known circumstances regarding access to 
such types of care. The figures are quite close and 
provide evidence of the utility of the linked cluster sample 
design. 
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Table 3 
Percent Distribution of the De Jure Population by Age and Sex, Based on SRS, NFHS, and PERFORM Sources for 1991-95 
SRS (1991) NFHS (1992-93) PERFORM (1995) 
ae Male Female Male Female Sex Ratio Male Female Sex Ratio 

0-4 14.4 14.4 14.6 14.6 917 13.8 14 909 
5-14 24.9 24.4 PUTIES: 26.0 868 PAI? 26.3 861 
15-29 28.4 26.8 PSs | 26.4 967 25.4 Dae O72. 
30-49 20.7 Zed 19.2 19.7 941 19.8 18.3 820 
50-64 8.2 8.5 8.4 8.8 960 8.6 9.6 993 
65+ 3.6 4.0 SZ 4.4 718 3 4.1 702 
Total 100.0 100.0 100.0 100.0 100.0 100.0 


Source for sample Registration System (SRS): Office of the Registrar India (1993a) 
Source for NFHS: National Family Health Survey, Uttar Pradesh (1992-93) 


Total Number of Estimated Public and Private Bey Dai, Points by Type in Uttar Pradesh, India: 1995 
Fixed service delivery points Number Individual service agents Number 
Total 31,400 Total 1,099,825 
Hospitals Physicians 
Government allopathic 968 Private resident allopathic 32182 
Government ISM 688 Private visiting allopathic 9,011 
Municipal allopathic 57 Private resident (unqualified) 62,880 
Municipal ISM 23 Private resident ISM 42,343 
Private 5212 Private visiting ISM 9,138 
Private voluntary 130 Anganwadi workers 25,994 
Private ISM 35 Village health workers 65,532 
Industrial 61 Traditional birth attendants 110,546 
Medical colleges y Medical shops 40,979 
CHC/PHC/Additional PHC 3,948 General merchants UBS) (9) 
Subcentres 20,151 Kirana shops 376,679 
Other 137 Pan shops 136,353 
Depot holders 5,818 
Other 48,855 


3.3 Estimation Approaches 


The estimated number of CHC/PHCs and SCs in Table 4 
is based on the assumption that each such facility serves a 
fixed population size, i.e., 30,000 and 5,500 respectively — 
the figures used by the government for planning health 
service delivery. The precision of the estimation would 
have been improved if the actual size of the local catchment 
population were known. In the absence of this information, 
we have used a constant population estimate for these two 
facility types. 


Alternate estimation approaches were used prior to 
arriving at the above procedure. The first is illustrated in 
Table 5, which presents the actual and weighted counts of 
CHC/PHCs and SCs in each of the 28 survey districts. 
These figures are based on weighting the selected facilities 
by the SSU size only and without adjusting for multiplicity. 
The PERFORM sample selected in a total of 633 
CHC/PHCs or 34.8 percent of the total (1818) and 1,267 
subcenters or 13.3 percent to the total (9,491) in the 28 
districts. These can be compared against the actual numbers 
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of CHC/PHCs and SCs in 1995 obtained from the Uttar 
Pradesh Department of Health and Family Welfare. It is 
evident that this weighting approach substantially over- 
estimates the number of CHC/PHCs (3,472 compared to 
1,818) but yields a nearly identical number of SCs (9,495 
compared to 9,491). Using the villages and urban blocks as 
SSUs is reasonable as they are the public administration 
units (and population sizes) used to determine the location 
of subcenters. 

They, however, do not offer an adequate stratification 
basis for the larger health facilities. Precision is lost because 
we weight with the inverse of the SSU’s population and 
when CHC/PHCs are selected in for very small SSUs, the 
associated weight is disproportionately inflated. This results 
in a higher-than-actual count of such facilities, a situation 
most problematic in two districts — Allahabad and Sultanpur. 
If these two districts are eliminated, the over-estimation is 
22.5 (+ 0.8) percent instead of 91 percent. (Under-estima- 
tion of CHC/PHCs results where the reverse occurs, as in 
Bareilly district. Because of PPS, large stratum IV villages 
have small weights, and in fact most selected FSDPs in this 
district have been sampled in the SSUs of this size.) 

A second estimation approach used was to calculate the 
expected number of CHC/PHCs and SCs based on a priori 
knowledge that such facilities were located in SSUs of 
minimum size 30,000 or 5,500, respectively. With 1991 
census information on the SSU population, we recon- 
structed the distribution of each district’s population by 
stratum size and divided each stratum by the CHC/PHC or 
SC catchment size (30,000 or 5,500 respectively). This 
provides the expected number of CHC/PHCs and SCs for 
each district. We can compare this with the observed 
number of such facilities, obtained at the time of fieldwork 
where local community informants were asked whether 
there was a CHC/PHC and/or SC located within the SSU. 
This comparison is shown in Table 6, which also includes 
a fieldwork organization code (I to IV) in the event any 
pattern of survey error is evident. This approach 
overestimates the number of subcenters by 19.6 percent and 
under-estimates the number of CHC/PHCs by 26.5 percent. 
Excluding the two districts with a high number of stratum I 
SSUs (Allahabad and Sultanpur) reduces the CHC/PHC 
underestimation to 10.2 percent. Tabulation of estimation 
bias by fieldwork organization shows no systematic bias. 

The results from the two weighting approaches suggest 
that the SSU offers an appropriate measure of size (MoS) 
for the selection of subcenters, since its average population 
size may approximate the SC’s catchment size of 5,500. A 
larger MoS may have served the selection of CHC/PHCs 
better, since this facility’s catchment size covers those for 
five to six subcenters. Because SSU size is the basis for the 
weight for CHC/PHCs, when the selected SSU is small, the 
bias in estimated counts can be large. A future design to 
consider is to use a cluster of SSUs that are contiguous to 
the selected SSU and have an MoS similar to the catchment 
size of CHC/PHCs. The probability of such a facility being 
present within the boundaries of the SSU cluster will then 
be higher and the weight, constructed on the basis of the 


total population in the SSU cluster, more reliable. In other 
words, our estimation is limited by not knowing how many 
SSUs are served by one CHC/PHC. 


Table 5 
Total Actual and Estimated Total Number of Community Health 


Centres, Primary Health Centers,* and Subcentres by District 
in Uttar Pradesh, India: 1995 


CHC/PHC Sub-centre 
District 
Actual Estimated Actual Estimated 

Aligarh dd 69 399 369 
Azamgarh 103 69 475 949 
Almora 44 104 254 468 
Allahabad 112 981 594 677 
Ballia 73 95 357 485 
Banda 89 101 322 302 
Bareilly 71 42 Shp) 162 
Dehradun 24 41 39 60 
Etawah 69 84 323 364 
Fatehpur 57 73 309 327 
Firozabad 33 34 234 236 
Gonda 107 183 528 461 
Gorakhpur 59 84 470 460 
Jhansi Sil WH] 75) 157 
Kanpur Nagar 12 13 81 74 
Maharajgang 30 89. 195 180 
Meerut 76 187 410 119 
Mirzapur 64 69 309 302 
Moradabad 92 81 485 248 
Nainital 53 79 287 344 
Rampur Si) 19 170 139 
Saharanpur 60 49 293 388 
Shahjahanpur 52 59 301 298 
Sultanpur 70 487 394 649 
Tehri Garhwal 31 5 159 63 
Unnao 63 162 344 106 
Sitapur 87 44 437 450 
Varanasi 122 144 616 658 
Total 1818 3472(+21) 9491 9495(+15) 
Total? 1636 2004(+13) 


“Includes additional primary health centres 

» Excludes Allahabad and Sultanpur districts 

Source for 1995 actual figures from Government of Uttar Pradesh 
Department of Medical and Family Welfare. 
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Table 6 
Observed and Expected Sampled Number of CHCs/PHC* and Subcentres Within the 
Rural Village (Urban Block) by District in Uttar Pradesh, India: 1995 


CHC/PHC 
District 
Actual Estimated 

Aligarh 6 S) 
Azamgarh 3 5 
Almora 5 2 
Allahabad 19 4 
Ballia 9 7 
Banda 8 y 
Bareilly ®) 3 
Dehradun 5 7 
Etawah 8 of 
Fatehpur 9 ql 
Firozabad 6 6 
Gonda 8 5 
Gorakhpur 5 4 
Jhansi 7 6 
Kanpur Nagar 2, 2 
Maharajgang 4 4 
Meerut 12 8 
Mirzapur 7 4] 
Moradabad 5 5 
Nainital 6 4 
Rampur 2 5 
Saharanpur 6 6 
Shahjahanpur 5 3 
Sultanpur 16 6 
Tehri Garhwal 1 3 
Unnao 3 6 
Sitapur 10 6 
Varanasi 6 5 
Total 186 147 
Total? 151 iy 


* Includes additional primary health centres 
> Excludes Allahabad and Sultanpur districts. 


4. DISCUSSION 


The cluster-based sample design for generating 
independent samples of facilities and households, which 
can be analyzed individually or jointly, does warrant more 
extensive consideration in data collection efforts for health 
program research and evaluation in developing countries. 
Careful design and fieldwork sampling and execution can 
yield high-quality and acceptably precise survey estimates, 
as our results show. The weighted totals, rather than sample 
totals, themselves are numbers useful to program planners 
who decide the flow of personnel, material, and financial 


Sub-Centre 
Field Work Company 
Actual Estimated 
10 Ni) II 
24 15 Ill 
14 9 I 
Ly. 18 Il 
34 27 Ill 
19 27 Ill 
10 16 II 
10 21 I 
17 20 II 
22 pia) IV 
28 30 II 
15 18 IV 
16 20 IV 
16 24 II 
6 8 II 
9 13 IV 
12 34 II 
Ze pH) Il 
9 19 I 
19 19 I 
14 16 I 
USS 21 I 
14 15 II 
21 15 IV 
3 10 I 
17 il IV 
9 24 IV 
18 18 Il 
450 538 


resources to and among various facility sites and area 
locations. The linkage of facility to individual records offers 
further important analytic opportunities to assess the 
relative importance of personal background and service 
supply factors on health outcomes of interest (e.g., Boyd 
and Iversion 1979). 

At the same time, our application of this design reveals 
several lessons. First there is an obvious need to monitor 
the survey fieldwork closely with increased on-site data 
entry so that the apparent “push” of eligible women out of 
the older age ranges can be prevented. This is difficult to 
detect through individual questionnaire spot checks but can 
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be observed in aggregate tabulations produced, say, weekly 
on completed questionnaires. Second, the excess count of 
CHCs/PHCs in two districts, where the survey fieldwork 
involved two different organizations suggests that stratum 
I villages might have been disproportionately selected or 
that some of the CHCs/PHCs reported to be within the SSU 
boundaries were in fact not. The former may have occurred 
as a sampling error since each fieldwork organization was 
provided with a list of sampled SSUs. Third, the listing and 
mapping of SSUs for facilities, individual health care 
providers, and households are an important stage of the 
fieldwork. Careful execution of this task allows the sampled 
units to be re-located for future follow-up. This will be an 
essential measurement effort for evaluating the IFPS 
project. 

Certainly for a survey as complex as PERFORM, scaled 
to capture the levels of and differentials in the patterns of 
health service delivery and client use in an area as populous 
as Uttar Pradesh, the fact that the quality of the data meets 
most standards of precision evidences an important 
fieldwork achievement as well as design innovation. 
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Computer-assisted Interviewing in a Decentralised Environment: 
The Case of Household Surveys at Statistics Canada 


J. DUFOUR, R. KAUSHAL and S. MICHAUD’ 


ABSTRACT 


In 1993, Statistics Canada implemented Computer-assisted Interviewing (CAI) for conducting interviews for some 
household surveys that were conducted in a decentralised environment. The technology has been successfully used for a 
number of years, and most household surveys have now been converted to this collection mode. This paper is a summary 
of the experience and the lessons that have been learned since the research started. It describes some of the tests that led 
to the implementation of the technology, and some of the new opportunities that have arisen with its implementation. It also 
discusses some challenges that were faced when CAI was implemented (some are on-going issues ), and ends with a brief 


overview of where this may lead us in the future. 


KEY WORDS: Household surveys; Data collection; Computer-assisted interviewing; Decentralised environment. 


1. INTRODUCTION 


The first systems of computer-assisted interviewing 
(CAD) were developed in the early 1970s (see Nicholls and 
Groves 1986). These systems were mainly developed by 
market research organisations in the United States and, a 
little later, independently by well-known university research 
centres. During the late 1970s and early 1980s, computer- 
assisted interviewing systems became much more sophisti- 
cated, and their use expanded greatly. By the late 1980s, a 
number of universities and survey research centres in the 
United States had a computerised collection system (see 
Lyberg, Biemer, Collins, de Leeuw, Dippo, Schwarz and 
Trewin 1997). Clark, Martin and Bates (1997) provide an 
overview of the development and implementation of such 
systems in four major government statistical agencies. 

In 1987, Statistics Canada conducted its first experiment 
with computer-assisted interviewing for household surveys. 
At that time, the tests were done in a “centralised telephone 
collection environment’. The series of tests with computer- 
assisted interviewing was extended into the early 1990s to 
try to adapt to the more general collection methodology. 

At Statistics Canada most household surveys share a 
common sampling frame and data collection environment. 
The main user of this frame is the monthly Labour Force 
Survey (LFS). Data collection is decentralised with the 
initial interview in person at the selected dwelling and the 
subsequent five interviews by telephone from the inter- 
viewer’s home. To accomplish this, almost a thousand 
interviewers have been equipped with portable computers. 
Interviewers are attached to one of the five regional offices 
located throughout Canada. A number of household surveys 
in the bureau follow a similar collection strategy by 
subsampling from the Labour Force Survey sample, by 
administering a series of supplementary questions after the 
Labour Force Survey interview or by contacting persons 
who have formerly participated in the survey. As a result, 


not only is the Labour Force Survey sample shared with 
other surveys, but so is the collection infrastructure. All 
interviewers are required to work on the Labour Force 
Survey for a specified week each month, and for the rest of 
the time, they have been trained and equipped to collect 
data for other surveys. For further details on the Labour 
Force Survey methodology, see Statistics Canada (1998). 

The 1990s saw testing of the implementation of the 
computer-assisted collection mode not only for the LFS but 
also for other surveys sharing that common infrastructure 
and having very different requirements. The results of the 
various tests led to the implementation of computer-assisted 
interviewing for the LFS in November 1993 (Dufour, 
Kaushal, Clark and Bench 1995) while its supplementary 
monthly surveys have been changed gradually. In January 
1994, a new longitudinal survey, the Survey of Labour and 
Income Dynamics (SLID) was launched using computer- 
assisted interviewing (see Lavigne and Michaud 1995). 
Since then, the National Population Health Survey (NPHS) 
along with the National Longitudinal Survey of Children 
and Youth, (NLSCY) introduced in August and November 
1994 respectively, have also adopted this collection mode 
(see Tambay and Catlin 1995, Brodeur, Montigny and 
Bérard 1995). For further details on the structure and 
implementation of this computerised collection mode in 
longitudinal surveys, see Brown, Hale and Michaud (1997). 
Today most of Statistics Canada’s household surveys are 
collected using a computerised mode and a common 
infrastructure. 

This article focuses primarily on methodology aspects of 
decentralised computer-assisted interviewing for household 
surveys. We provide an overview of the implementation 
process for the statistical agency as a whole, a brief 
discussion of the challenges associated with the new 
collection vehicle and a list of references for more detailed 
information on specific topics. Despite “growing pains”, 
Statistics Canada is continuing to experiment with and 
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implement this new technology in various surveys to render 
these surveys more cost efficient and to improve data 
quality and the survey monitoring process. 

The article is divided into five sections. In the next 
section, aspects of implementation are discussed with 
reference to several surveys. Section 3 details new 
opportunities arising from computer-assisted interviewing. 
The ongoing challenges and new problems that surveys face 
as a result of using a decentralised computerised collection 
mode, as well as the changes that are taking place, are 
discussed in Section 4. The last section describes the future 
of CAI for household surveys at Statistics Canada. 


2. FIRST YEARS OF IMPLEMENTATION 


Adopting a computerised collection method for house- 
hold surveys held the promise of several benefits: (i) a 
decrease in survey costs, (ii) better data quality, (111) the 
possibility of using more complex questionnaires, (iv) data 
made available more quickly, (v) a tool for tracing 
operations, (vi) the possibility of using dependent 
interviews, and (vii) a generalised collection method for all 
of the agency’s household surveys. However, these benefits 
were not realised overnight, or without effort. Ongoing 
evaluations and adjustments were required in the 
introduction and stabilisation phases. 

Despite a number of tests being conducted before the 
implementation of CAI, unforeseeable problems occurred 
with the adoption of this method, but over time, they 
became less frequent and easier to solve. In addition, during 
this period, the series of quality indicators analysed 
carefully by different groups of Statistic Canada experts 
were somewhat disrupted. It took about one year to realise 
the anticipated benefits. This section describes the main 
points in the process of changing from the traditional paper 
approach to computer-assisted interviewing, where 
collection and capture are integrated. 


2.1 Centralised Computer-assisted Telephone 
Interviewing 


The traditional approach to interviewing used a paper 
questionnaire filled out in pencil to facilitate edits made by 
the interviewer. Often such an approach is referred to as 
Paper and Pencil Interviewing (PAPI). In this traditional 
mode, an interviewer edited the questionnaire to ensure that 
the information was correct and complete. Information 
abbreviated to shorten the interview was filled-in in detail 
after the interview and before the form was sent for data 
capture. The first change towards computerisation was the 
use of Computer-assisted Telephone Interviewing (CATI). 
This computerised collection mode was used for surveys 
that were conducted by telephone from a central location. 
CATI was the first instance of amalgamation of the 
collection and capture of information in household surveys. 
Given the state of technology at that point, the computers 
capable of handling the complexity associated with 
computer-assisted interviewing were fairly large. Hence, 


CATI could replace PAPI only in centralised telephone 
surveys. In the 1990s, with the advent of more powerful 
portable computers decentralised CAI replaced PAPI. A 
decentralised collection mode is, in effect, what is used in 
most household surveys. In addition, data collection often 
required the ability to do either telephone interviews or 
personal visits. However, much of the know-how and 
experience of computer-assisted telephone interviewing 
could be applied to decentralised computer-assisted 
interviewing. 

Since the 1980s, it was the Labour Force Survey (LFS) 
that served as the main research and testing vehicle for 
CATI technology. The first test, conducted in 1987, was a 
controlled study that compared CATI in a centralised 
environment to PAPI. It consisted of a research project 
carried out jointly between Statistics Canada and the US 
Bureau of the Census (see Catlin and Ingram 1988). The 
study showed that there were differences between the two 
collection methods in terms of data quality indicators, and 
those differences were in favour of CAI in terms of lower 
rejection rates on edit, reduction in path errors on the 
questionnaire and decrease in undercoverage in the LFS. 

While CATI was never implemented for the LFS, the 
experience was used to set up a CATI facility for use in 
random digit dialling (RDD) in household surveys. As 
technology progressed, CATI was used to collect more 
complicated RDD surveys like the General Social Survey 
(GSS) and the Violence against Women Survey. 
Computer-assisted telephone interviewing continues to be 
used as an integral part of household collection at Statistics 
Canada complemented by the computer-assisted inter- 
viewing infrastructure. 


2.2 Technological Testing 


A new wave of testing began in the early 1990s as part 
of the decennial redesign of the LFS (Singh, Gambino and 
Laniel 1993; Drew, Gambino, Akyeampong and Williams 
1991). The launching of three large scale longitudinal 
surveys by Statistics Canada made the investment for a CAI 
infrastructure possible by sharing the costs among a number 
of surveys. Consequently, in 1991, a second test was 
conducted using the LFS and SLID to study the feasibility 
of using new technologies (see Williams and Spaull 1992). 
Portable computers which require the use of a stylus rather 
than a keyboard for entering data were tested. The results 
showed that the technology was promising but that it 
needed further improvements for it to be used to handle the 
requirements of Statistics Canada’s household surveys. 

The following year, from July 1992 to January 1993, a 
third and a fourth test were conducted, this time using 
conventional portable computers. The results for the LFS 
are documented in Kaushal and Laniel (1995), while the 
results for SLID are reported in Michaud, Le Petit, and 
Lavigne (1993) and Michaud, Lavigne and Pottle (1993). 
For the LFS, the main objective of this third test was to 
determine if the transition to the new technology would 
disrupt the LFS data series. The secondary objective of the 
test was to determine whether the new technology affected 
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data quality and interview costs. Additional objectives of 
this test were the operational development and evaluation 
of the CAI approach. For the longitudinal surveys, the main 
concern was the length and complexity of the question- 
naires and the addition of new functions, such as tracing. 
Consequently, the main criterion in assessing the 
application was the feasibility of developing various 
functions. The results showed that CAI had no major 
impact for the LFS on either the data series disseminated, 
the survey’s main quality indicators, or interview costs. On 
the strength of general comparisons with outside sources 
and an analysis of missing variables, the new technology 
was adopted. 


2.3 New Dimension of Nonresponse 


With the adoption of CAI, there was an unintentional 
development of a new dimension of nonresponse that is due 
to “technical problems”. Such nonresponse resulted from 
cases that were lost or not received before the end of the 
collection period. The PAPI version of this type of 
nonresponse was related to occasional postal problems. 
Conceptually, these situations do not refer to real 
nonrespondents; however, the information is not available 
in time to produce estimates. 

These technical problems assume three different forms: 
(i) transmission problems, (ii) equipment problems, and (iii) 
unavoidable problems. Transmission problems are the most 
common. They arise, for example, when telephone lines are 
down, when there is a problem with the automatic down- 
loading of data, when an attempt is made to download data 
while maintenance is being carried out on the mainframe 
computer, or simply because of a malfunction in the CAI 
system. The second type of problem, although less 
common, occurs when a hard drive crashes, the magnetic 
tape drive fails, there is insufficient memory or there are 
computer equipment problems at the regional offices. 
Finally, unavoidable problems, which are even less 
common, include specific problems implicitly created by 
the above two categories, for example when only one of the 
two components expected from a respondent is transmitted 
or if the initialisation parameters needed for the proper 
functioning of the programs are missing. 

Nonresponse due to technical problems diminished over 
the initial months. This component of nonresponse was 
analysed quite carefully to explain an upward trend in 
nonresponse and to assess the performance of the CAI 
approach (see Simard, Dufour and Mayda 1995; Dufour, 
Simard and Mayda 1995). At the start of the conversion of 
the household surveys to CAI, technical problems repre- 
sented on average 15% of total nonresponse and could 
alone explain up to 25% of nonresponse. It took almost a 
full year before any significant reduction was observed in 
this component of nonresponse. Today, in 1997, the nonres- 
ponse due to technical problems is practically non-existent. 

In the first year, the bulk of the problems were due to a 
conflict over memory management in the notebook 
computer between two pieces of software used in case 
management. This was resolved by a re-write of a part of 
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the software, which eliminated the conflict and made the 
system more efficient. The more subtle issues of the 
transition were communication and experience. A 
communication strategy was developed to enable the 
different players (in particular technical personnel and 
interviewers) to better understand each other, disseminate 
information more quickly and adequately inform all persons 
concerned. When CAI was first introduced, it took 
technical support personnel more than a day to find a 
solution to some problems. Faster response procedures were 
established, and a 24-hour support service was set up at 
head office in Ottawa. With such a substantial change, a 
learning and adjustment period is required, and Statistics 
Canada was no exception. 


2.4 Impact of CAI on Nonresponse 


Are there grounds for believing that the use of CAI had 
an effect on nonresponse rates? The answer to such a 
question has to be yes in light of the technical problems 
encountered, primarily at the beginning of the conversion 
process. However, if this aspect of the nonresponse is 
discounted, there is no indication that CAI had any lasting 
effect on nonresponse rates. The LFS nonresponse 
fluctuated following the introduction of CAI, but these 
fluctuations may be explained by a number of other factors 
(the redesign of the sample, which is now more urbanised; 
hiring of new interviewers; efc.), since the LFS was 
undergoing a major overhaul. It took just under two years 
for overall nonresponse to return to levels similar to those 
recorded in the paper and pencil era. 

In the LFS, the conversion took place over a period of 
five months during which time the CAI and PAPI 
nonresponse rates could be compared. These comparisons 
show that the nonresponse rates for CAI (excluding 
technical problems) and those for PAPI were in the same 
range and exhibited the same trends (see Simard and 
Dufour 1995). Moreover, all the main components of 
nonresponse, namely refusal to participate in the survey, 
household temporarily absent, no one at home and other 
reasons, exhibited similar annual patterns before and after 
the implementation. There were concerns that respondents 
would be more reluctant to answer due to the presence of a 
computer for personal interviews, resulting in an increase 
in refusals. However, no change in the refusal component 
was detected. 

In early 1995, the three longitudinal surveys (SLID, 
NLSCY and NPHS), as well as the LFS, were conducted 
during similar collection periods. The current case 
management environment, as well as the sharing of the 
infrastructure among surveys, created extra pressure on 
interviewers in the field. Moreover, the survey collection 
periods were limited because there was a limited number of 
applications that could reside on the computers at the same 
time. Analysis was done to determine if response problems 
arose from conducting several surveys simultaneously, or in 
quick succession, in the field using CAL. For the quarterly 
collection of the NPHS, interviewers followed-up 
nonrespondents in previous collections. An analysis was 
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carried out to determine the possible conversion rate. The 
results showed that in the case where there were fewer CAI 
surveys in the field at the same time, a first wave of 
follow-ups of nonrespondents increased the response rate, 
but continuing the process for a second or third time 
brought few gains (an increase of 5.76% from the first to 
the second quarter, 0.97% from the second to the third, and 
0.91% from the third to the fourth). However, a last 
follow-up was carried out in June 1995 when there were 
almost no surveys in the field. This procedure improved the 
overall response rate by approximately 5%, which was 
higher than expected. This led to the conclusion that CAI 
had to be able to give more flexibility in the length of the 
collection period and allow multiple applications to reside 
on the computer in order to maintain the response rates that 
would have been obtained in a paper and_ pencil 
environment. 


3. NEW OPPORTUNITIES FOR 
HOUSEHOLD SURVEYS 


The adoption of CAI collection has added new 
opportunities to household surveys. These new opportu- 
nities, which were either non-existent or operationally 
difficult in a paper and pencil mode, help to reduce 
non-sampling errors, to collect more specialised 
information, to facilitate the reconstruction of family units 
and to make contact with family units that break apart or 
merge. In fact, this collection method is better suited to 
adjust the collection process according to the changing 
needs of today’s society. 


3.1 Dependent Interviews 


The introduction of the new technology served to resolve 
household survey problems that had proven intractable 
under the traditional paper and pencil interview approach. 
In particular, CAI helped to increase the information that 
could be provided by the interviewer to a respondent 
contacted for the second time for the reduction of (i) 
response error (coding, capture or recall error), in particular 
the seam problem and telescoping, and (ii) response burden 
by confirming the information instead of requesting it again 
(or by requesting only partial information). 

The seam problem has been documented for longitudinal 
surveys in Murray, Michaud, Egan and Lemaitre (1990), 
which notes that the problem arises in reconciling data from 
successive collection periods. If no reconciliation has been 
attempted between collections, an artificially large change 
in estimates is generally observed at each collection 
transition. This problem is generally explained by 
respondents’ difficulty in pinpointing the date when a 
change occurs. As to telescoping, it results from a tendency 
to include certain events that occurred outside the reference 
period. 

Under the traditional PAPI approach, the type of 
information that could be provided to interviewers was 
limited. Questionnaires could only be pre-printed with basic 


information, as there were physical limits to the amount of 
information that could be pre-printed, especially for long 
questionnaires. In some cases, additional information was 
even printed on a separate questionnaire. This procedure 
also involved additional logistical problems for the 
interviewer. The use of information from earlier occasions 
in the process is known as feedback. With computer- 
assisted interviewing, feedback is made possible in two 
ways: proactively and reactively. A discussion of this is also 
provided in Brown et al. (1997). 

Proactive use of feedback is used to reduce response 
error by helping the respondent to situate him/herself. For 
example, SLID gathers detailed information on a maximum 
of six jobs in the previous year. Without feedback, the name 
of the employer or the occupation might be written slightly 
differently, and a job that continued over a period of two 
years could be incorrectly classified as a change. Initially 
there was some concern that the respondent would perceive 
feedback negatively, but in fact, few negative comments 
have been received. 

The confirmation rate is generally high — over 90% for 
data that are presented to the respondent (see Hale and 
Michaud 1995). The study of Hiemstra, Lavigne and 
Webber (1993) concerning the labour market suggests that 
while feedback generally serves to reduce the seam effect, 
the problem is only partially solved. For example, SLID 
confirms employment, job search or joblessness at the 
beginning of the previous calendar year over a one-year 
recall period. Micro-comparisons with a cross-sectional 
monthly survey, conducted over the first five months of the 
year, suggest that feedback greatly reduces the seam effect. 
However, consistency with cross-sectional data decreases 
over the months, which seems to suggest that response 
error, although eased by feedback, is still a problem. 

The proactive use of feedback may, however, 
underestimate measures of change. For this reason, for 
sensitive information and for reasons of confidentiality, the 
technique is also used reactively. The reactive use of 
feedback can be used to detect unusual changes, or to 
confirm inconsistencies in the data. As an illustration, in the 
interview for the first wave of SLID, jobless spells are 
identified and for each spell the respondent is asked 
whether employment insurance benefits have been received. 
The second wave interview asks for detailed information on 
various sources of income and amounts received including 
employment insurance benefits. Comparisons with outside 
sources suggest that traditionally, the amounts of 
employment insurance reported in a survey represent 
approximately 80% of the contributions paid. In SLID, 
previous information was stored in memory. If an amount 
was not reported and there was an indicator flagging an 
inconsistency with the first-wave interview, an additional 
question was asked to determine whether the amount had 
been omitted. An analysis of the first wave of SLID 
suggests that reactive checking increased the number of 
reported cases by nearly 30%. However, 28% of these 
persons who had neglected to report an amount, confirmed 
that they had received an amount but were unwilling to 
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report that amount. There was thus confirmation of the 
source, but the amount had to be imputed and the problem 
was not totally solved. More details on this subject may be 
found in Dibbs, Hale, Loverock and Michaud (1995). 


3.2 A More Efficient Tool 


With an efficient collection tool like CAI, it is now 
possible to collect, to limit, to access and to transfer detailed 
information which would traditionally have been very 
difficult, or even not possible, to do with PAPI. 

3.2.1 Matrix of Relationships Between the Various 
Members of a Household 


Household surveys create different levels for analysis 
such as the economic family and the census family, by 
using the relationships between the various persons in the 
household with a single person often called the “family 
head”. There are limitations to this method for example, in 
identifying the children of blended families or 
reconstructing families to three generations. In a 
longitudinal context, the concept of family head is a 
definition that can vary over time and so a number of 
longitudinal surveys have used a matrix of relationships for 
all members of the household. CAI can limit collection to 
the lower diagonal of the matrix. Provided that the 
composition of a household does not change between two 
collections, it is not necessary to re-ask it for the 
relationship matrix. Interactive edits (based on age, for 
example) serve to correct any relationships captured in 
reverse (e.g., a parent-child relationship). It took a number 
of attempts to develop an effective means of identifying 
relationships that would allow not only for the collection of 
the information but also for easy correction. With the 
improved version of the collection procedure, less than 1% 
of relationships required further correction after collection 
(as compared to 5.3% inconsistency before the interactive 
edits on the relationship matrix). Corrections in a CAI 
environment probably continue to be one of the areas in 
which research is still required. 


3.2.2 Access to More Sophisticated Collection 
Instruments 


CAI has also provided access to more sophisticated 
collection instruments. For example, the NLSCY obtains a 
variety of information on a cohort of children aged 0-11 
years. One part of the interview is designed to measure the 
child’s vocabulary level. The survey uses the Peabody 
Picture Vocabulary Test (PPVT) as one of its collection 
instruments. However, the PPVT is normally used in a more 
specialised environment, and persons administering it 
generally need several days of in-depth training since the 
test involves a series of images, and the child is asked to 
choose the image that corresponds to a given word. The 
starting level depends on the child’s age. Questions are 
administered until the child gets a certain number of wrong 
answers. At this point, the interviewer must return to the 
starting level and re-administer the previous questions, until 
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the child gives a pre-determined number of wrong answers. 
The administration of the test calls for determining a 
threshold based on criteria, counting the number of wrong 
answers, skipping between questions depending on the 
number of wrong answers, and stopping the test. These 
procedures would have required a considerable amount of 
training if it had been necessary to administer the test on 
paper. CAI has greatly facilitated the process by allowing 
programming of the edit rules in advance. The data from the 
first collection suggest that the computer-assisted condi- 
tions of administration yield good-quality results when 
compared to external norms. 


3.2.3 Establishing Longitudinal Links 


In the case of longitudinal links, it may happen that all 
the members of an initial household may be part of the 
longitudinal sample, as in SLID for example. In subsequent 
collections, the longitudinal persons are interviewed along 
with all persons with whom they live. In the case of a 
household that splits, a new household must be created for 
the persons who left the original household. With the 
adoption of CAI, it became possible to create new unique 
household identifiers linked to the original identifiers, this 
made it easier to reconcile the dynamics of change in 
household composition. A particular problem that has been 
greatly lessened is the treatment of the real duplicates that 
occur as a result of changes in household composition. For 
example, an adolescent might belong to a given household 
at the time of the first collection, then leave his parent’s 
household by the time of the second collection but return to 
the original household by the time of the third collection. In 
the second collection, the person is identified as belonging 
to anew household, and a new identifier is thus associated 
with him. In the third collection, when the parents’ 
household is again contacted, the adolescent who has 
returned may be indicated as a new person in the household. 
If the interviewer is shown the list of persons who have 
formerly been part of the household, the need to reconcile 
duplicates is greatly reduced. A similar treatment has been 
carried out for jobs where a list of previous employers is 
used for longitudinal reconciliation of jobs. 


3.2.4 Tracing of Individuals 


With the conversion to CAI, certain procedures such as 
tracing were automated. Brown et al. (1997) gives specific 
examples. As noted above with respect to establishing 
longitudinal links, traced individuals may all be put into a 
new household with a unique identifier. Fewer paper 
manipulations are required, and it is now possible to obtain 
more management information. CAI has made it possible to 
set up a two-level tracing procedure. The interviewer first 
attempts the tracing. If this is not successful, all information 
on the case is transferred to a tracing unit in the regional 
office where more sources for tracing are available. 
Automation has eliminated many manipulations and 
transcriptions of records on paper. Formerly when a 
household split, a new identification sheet was usually 
created on paper with a link to the previous household. The 
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names of the persons who had moved were entered on it. If 
the person to be traced was not found, all the forms for all 
the persons who had been living together the previous year 
were transferred. These manipulations greatly increased the 
risk of error. Transfers of cases between tracing levels are 
also done more quickly. In addition, each call is recorded 
automatically along with its result. While there was a 
similar procedure with the paper and pencil approach, the 
information was seldom entered. It was also hard to analyse 
the information for determining the most useful tracing 
sources. 

Tracing is a key factor in maintaining data quality. With 
current tracing procedures, cases requiring tracing can be 
kept in the field a little longer, but the collection window 
remains limited. It is possible that more effective 
procedures can be established if the efforts of the various 
longitudinal surveys are integrated. Increased functionality, 
combined with central tracing, is currently being examined. 
This would make it possible to combine the tracing efforts 
of the various surveys, and it might also make it possible to 
have batch entries to try to link cases requiring tracing to 
databases. 


3.3. New Quality Indicators 


The CAI approach adopted by Statistics Canada for its 
household surveys features a complex system capable of 
monitoring survey activities during the collection period to 
ensure their smooth operation. This system called the “case 
management system” (CMS), is a sophisticated system that 
manages all survey activities from the beginning to the end 
of the survey cycle. This system is flexible, since it can be 
adapted to the requirements of the different household 
surveys that use it. The CMS performs three main 
functions: (i) routing of cases, (ii) reporting of activities and 
(111) assisting interviewers. The routing component directs 
the movements of cases during the survey, whether from an 
interviewer to the regional office, from the regional office 
to head office, etc. The second component of the CMS 
produces different reports for describing the status of the 
survey at a given point in time, evaluating the performance 
and progress of the survey, and describing the status of 
interviews. A whole range of information is generated by 
this second component of the CMS. Lastly, the third 
module enables interviewers to perform their tasks more 
effectively, by giving options for making appointments, 
recording notes and so on. 

As a result, this system provides a mass of information 
on what is actually happening in the field during a survey; 
every action taken on a case is recorded by the CMS. The 
main challenge with such a system is to avoid getting lost in 
the great mass of information available. Work teams have 
been set up to master these information sources, develop 
new quality indicators using this information or combining 
it with information already available, find uses (e.g., 
additional training, improvement of the collection 
instrument), and develop ways to present these indicators 
effectively. 


A large number of quality indicators have been produced 
(see Simard et al. 1995; Allard, Brisebois, Dufour and 
Simard 1996) on a regular basis at different levels of 
interest (geographic, interviewers, administrative). These 
indicators may be grouped into two categories: 
informational and for monitoring purposes. Examples of 
informational indicators are: number of attempts before 
completing a case, distribution of interviews completed per 
day of collection, best day-hour combination for reaching 
a respondent, median duration of interviews, and number of 
edit rules triggered and ignored or triggered and acted upon 
(see Brisebois, Dufour, Lévesque 1997). Information 
indicators are used to improve or make changes to the 
collection strategy or process. 

In terms of monitoring, a series of indicators are used to 
trace irregularities, technical or human, in the field. Among 
these are: calls and visits done after the date of transmission 
but before the survey week, calls and visits done after 
Sunday of survey week, working period too early, working 
period too late, interviews too short, etc. This information 
serves to show whether instructions issued by head office 
are followed, and whether some interviewers require 
additional training. However, all data need to be analysed 
with caution to determine the cause of the irregularity. For 
example, an interview conducted at 4:30 am may well be at 
the request of a respondent, like a farmer, or due to an 
incorrect time on the computer clock (see Brisebois ef al. 
1997): 

CAI also offers interviewers the opportunity to include 
a comment for each question or to explain the reason for the 
code used. It is therefore possible to develop adequate 
training, to better understand the surveys and accordingly to 
adapt them to realities in the field. For example, this feature 
made it possible to conduct a special study on the reasons 
for refusal to participate in one of Statistics Canada’s 
household surveys; to conduct such a study would have 
formerly required a great deal of effort (see Allard, Dufour, 
Simard and Bastien 1996). 


4. ONGOING CHALLENGES OF CAI 


This section describes long-term challenges in 
developing, implementing and understanding the use of 
CAI for survey applications. The powerful tools provided 
by CAI have led us to degrees of complexity in content, 
software and electronic communications that may not be 
widely appreciated. The conversion to CAI has implied a 
new dependence on informatics. This dependence is one of 
the major challenges that Statistics Canada has to face with 
CAI, since the technology is changing so quickly. 


4.1 Workload of Interviewers 


A common infrastructure requires the sharing of limited 
resources, such as trained interviewers equipped with 
portable computers, by different surveys. As a consequence, 
any increase in either the number of surveys or the amount 
of information collected must be carried out jointly with the 
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other surveys. It should be noted that the same interviewers 
tend to be used by a large number of surveys, which can 
result in fairly large workloads, exacerbated by a short 
collection period. While response rates have recovered 
since the introduction of CAI, a heavy workload for 
interviewers can lead to deterioration in data quality, owing 
to fewer follow-ups and higher nonresponse. 

Given the nature of the CMS, an administrative structure 
for communication, based on the needs of a given survey 
(based on the response codes), must be put in place to 
provide for the routing of cases between the interviewers, 
their supervisors and the regional offices. Since CAI was 
first introduced, there have been great improvements in the 
communications process to ensure that all interviewers 
correctly receive their assignments, the latest version of the 
application or various changes; nevertheless, this process 
must be constantly monitored. For example, after the end of 
the collection period, cases must be transmitted and deleted 
from the interviewers’ computers. Often, the cases that 
were not transmitted consist mainly of nonresponse cases. 
The fact that these cases are not transmitted to head office 
after the end of collection means that the reasons for 
nonresponse are sometimes lost. While many of these 
problems can be detected during testing, the fact remains 
that a few exceptional cases still remain. 


4.2 Control Procedures for CAI 


The CMS and survey applications have the potential to 
generate many databases. The quantity of data is often 
overwhelming, and the data are not currently being used to 
their maximum potential. In addition, the speed inherent in 
CAI sometimes does not allow for sufficient time and 
resources to analyse and control this mass of information. 
For the moment, this information is used after the fact, but 
it would be highly desirable to be able to use it while the 
survey is in the field. 

This information should be made available to inter- 
viewers in an integrated format. However, a balance is 
needed to avoid excessive surveillance where interviewers 
focus more on the quality indicators than on the quality of 
the data. Ideally, analysis across several surveys could 
identify specific problems, which could then be dealt with 
in training kits that are brief and focused. In addition, 
response rates and coverage rates could be integrated for 
surveys. All this information could be used to achieve more 
efficient time management or to develop training in specific 
interview skills. 


4.3 Editing During Collection 


While CAI offers the possibility of including a great 
number of edit rules at the time of the interview, it is 
important here as well to maintain a balance between the 
rules programmed into the collection instrument and the 
rules applied during batch processing at head office. The 
rules programmed into the instrument prolong the 
interview, which results in an increase in both costs and 
response burden. Over time, and with rapid changes in 
technology, it should be possible to apply a larger number 
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of edits during the interview without interfering with its 
flow. On the other hand, clarifications at the time of the 
interview undeniably result in better quality data. The 
NPHS obtains better quality data in the second quarter by 
using information from the first quarter to feed the edit 
system. For example, clarifying with the respondent at the 
interview, led to the discovery that, for the arthritis variable, 
of the 7.0% of individuals who indicated a change in 
condition between the two quarters, 3.3% actually 
experienced a change while 3.5% represented errors. For 
further details, see Catlin, Roberts and Ingram (1996). 

With CAI, it is also possible to store information to 
identify which edit rules have been triggered and what 
corrections were made. A study of the most frequently 
triggered edit rules would determine which rules most 
affect data quality, with the results of these studies serving 
not only as information but also as inputs, for changing 
overly strict edit rules and also for sustaining a dynamic 
correction system. Another aspect that is just as important 
is the ease with which the interviewer can make the 
necessary corrections. If the corrections can be made to the 
actual response or the preceding response to a question, the 
interviewer can easily identify the changes to be made. If 
the correction involves editing between several answers, 
then the need to determine which one requires correction, 
and to move between the various answers in which there 
may be an error, sometimes makes the process too complex 
for the edit to be carried out during the interview. 

Apart from technical problems, there are methodological 
problems associated with the effect of edit rules on data 
quality. At what stage are the different edit rules the most 
effective? The rules that affect the flow of the questionnaire 
and those that determine which persons are outside the 
scope of the survey, are critical edit rules. The key variables 
used for poststratification and key estimates are best 
resolved at the time of the interview. The quantity of edit 
rules that can be incorporated into the CAI system must be 
balanced with the speed of the portable computer. In 
addition, when some edit rules are being developed for the 
instrument and others for central processing, care must be 
taken to ensure that the two types of rules are not 
contradictory. 


4.5 Data Confidentiality 


Maintaining data confidentiality, as stipulated by the 
Statistics Act, is one of the fundamental requirements of the 
use of CAI and the systems that support it. To meet such a 
requirement, a number of procedures have been developed 
including a computing environment with two commu- 
nication networks, one external and the other internal. The 
data are transferred physically, by tape, from the external 
network to the confidential internal network since there is 
no link between these two networks. It is impossible to 
access the internal network using a public modem. 
Confidentiality is also ensured by encryption of data 
whenever they must be transmitted over telephone lines. In 
addition, an access control system is incorporated into all 
portable computers, enabling only the interviewer to access 
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the information. The data are also encrypted while residing 
on the notebook. 

The challenges relating to confidentiality in a CAI 
environment are quite different from those encountered 
with PAPI. Dependent interviews offer such a challenge 
for SLID. Information available from the preceding wave 
family unit may become sensitive in the case of, say, a 
family break-up. Thus, while the new technology offers the 
benefits of dependent interviews, these are accompanied by 
drawbacks that must be analysed for the specific situation. 

With the arrival of audio-CASI (known by the acronym 
CASI-A), sensitive subjects may be handled more easily. 
With this interview technique, respondents are linked to the 
computer with earphones, and the questions are read by a 
digitised voice. Since the question is heard via the headset, 
the respondent can choose whether or not to display the 
question on the screen. With these features, the respondent 
can complete the questionnaire in total anonymity. The 
NLSCY is planning to begin using this collection 
instrument by the year 2000. 


4.6 Re-Interview Programs 


CAI offers some enhancements over PAPI-based 
re-interview programs. Firstly, the rapid electronic 
transmission of data reduces discrepancies due to recall and 
memory problems since re-interview can be conducted 
quicker after the initial interview. Strict adherence to 
reconciliation procedures built into the software provides 
more accurate estimates of measurement error. This would 
eradicate the problem of interviewers peeking at the 
questionnaire before starting the re-interview. As well, 
reconciliation can be done after a subset of questions, a 
section or at the end of the questionnaire and as many times 
as desired. Re-interview cases are easily automated and 
integrated into a quality control process based on 
characteristics of the interviewer or the interview (e.g., 
specific cases related to training issues, cases belonging to 
a specific group, efc.). The quality of the data is better 
since a great number of edit rules, identical to the ones used 
during the interview, are programmed for the re-interview. 
The features available from the CMS are also an asset for 
the re-interview program: progress of the re-interview 
program, performance and progress of the re-interview, 
easy transfer of cases, efc. 


4.7 Interviewer Training 


With the adoption of CAI, interviewers had to cope with 
a major change in their work method. Training was 
therefore an essential stage in enabling them to adapt 
effectively to the computerised collection method. They 
became familiar with new work tools, including the 
keyboard, the portable computer and all the computer 
procedures, such as saving data, charging batteries and 
transmitting by modem. They also had to adapt their 
interview style to the requirements of CAI. New 
interviewers, for their part, had to familiarise themselves 
with survey concepts, interview techniques and the 


collection instrument. To meet this challenge, Statistics 
Canada developed a training strategy based on the 
experience acquired during the previous testing, as well as 
on the experience of British and American colleagues. 

Interviewer training will always be one of the key factors 
in the success of Statistics Canada surveys, and the agency 
is continually innovating in this field. For example, one of 
the initiatives for the LFS is a training strategy to enable 
senior interviewers to regularly receive a small CAI 
assignment (approximately 15 cases), just so they can 
practice collection by this method and thereby stay abreast 
of changes in the CAI application. In addition to the regular 
practice cases that are always available on the computer, the 
CAI system will provide interviewers with modules 
integrated into the collection system, dealing with such 
complex subjects as coverage and multiple dwellings, to 
enable them to always be updated or to review various 
difficult concepts. 


5. FUTURE OF CAI AT STATISTICS CANADA 


In the new environment of limited resources and high 
response burden, collection is becoming increasingly 
customised. While business surveys have been doing it for 
some time, mixed collection is beginning to be in demand 
for household surveys. Centralised collection outside the 
collection window for a limited number of respondents can 
be used to improve response rates (to focus on tracing for 
example). The environment necessary for this type of 
collection more closely resembles a CATI environment in 
which shared database functions for a small sample are 
available, with call planning functions. 

A complete redesign of the CAI application and the case 
management system is expected to be completed by the turn 
of the century. In this redesign, work teams must take 
account not only of computer capacity but also of the 
human factor. The latter factor is important since data 
collection and data quality depend on it. Interviewers must 
read the screen and enter the responses, tasks that call for 
perceptual and motor skills different from those required for 
pencil and paper interviews. The wording of questions is 
also harder to read on the screen, and interviewers mention 
that it is now harder to visualise the overall structure of a 
questionnaire. Hence special attention must be paid to 
screen design, the choice of colours, the amount of text 
displayed, the key functions pre-programmed and the ease 
of moving between screens. Since interviewers are also 
asked to work on several surveys, an effort should be made 
to standardise screen formats as much as possible. 

As regards the hardware and software components, work 
teams are currently concentrating on choosing the best 
combination. At present, different softwares are used for 
different components of some surveys. In order to 
standardise the applications available as much as possible, 
there are plans to use a uniform platform for all surveys in 
a Windows environment. The Windows environment 
should give both interviewers and programmers greater 
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flexibility. The security systems must also be redesigned to 
conform to the technology adopted and to satisfy the 
requirements of Statistics Canada. Harmonisation of 
questions among surveys should be attempted, which would 
allow CAI programming to become more modularised. 
Respondent burden would also be reduced. 

The new system will have to be able to take account of 
both past and present requirements. For example, system 
features are re-examined in the light of the progress reports 
provided to operational staff in order to determine which 
areas need improvement. As noted in Section 4, a number 
of other possibilities are being considered such as, 
interactive training of interviewers, special training 
modules, the possibility of conducting re-interviews and 
better tracing tools. These procedures should make it 
possible to make better use of the flexibility resulting from 
the automation of the process. 

The case management system is also being redeveloped. 
One major consideration here is to obtain a robust 
communications system, in which changes can be sent out 
uniformly with a replication capability. While we still hope 
to develop a computer system that will be used for many 
years, the current reality seems to suggest that CAI is likely 
to continue to evolve rapidly. One challenge, then, since the 
technology is changing quickly (one need only think of the 
Internet), is to develop a new system that is flexible, so as 
to allow for adaptations without requiring a complete 
overhaul. 
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Regression Analysis of Data Files that are 
Computer Matched - Part II 


FRITZ SCHEUREN and WILLIAM E. WINKLER’ 


ABSTRACT 


Many policy decisions are best made when there is supporting statistical evidence based on analyses of appropriate 
microdata. Sometimes all the needed data exist but reside in multiple files for which common identifiers (e.g., SIN’s, EIN’s, 
or SSN’s) are unavailable. This paper demonstrates a methodology for analyzing two such files: (1) when there is common 
nonunique information subject to significant error and (2) when each source file contains noncommon quantitative data that 
can be connected with appropriate models. Such a situation might arise with files of businesses only having difficult-to-use 
name and address information in common, one file with the energy products consumed by the companies, and the other file 
containing the types and amounts of goods they produce. Another situation might arise with files on individuals in which 
one file has earnings data, another information about health-related expenses, and a third information about receipts of 
supplemental payments. The goal of the methodology presented is to produce valid statistical analyses; appropriate 


microdata files may or may not be produced. 


KEY WORDS: Edit; Imputation; Record linkage; Regression analysis. 


1. INTRODUCTION 


1.1 Application Setting 


To model the energy economy properly, an economist 
might need company-specific microdata on the fuel and 
feedstocks used by companies that are only available from 
Agency A and corresponding microdata on the goods 
produced for companies that is only available from Agency 
B. To model the health of individuals in society, a 
demographer or health science policy worker might need 
individual-specific information on those receiving social 
benefits from Agencies B1, B2, and B3, corresponding 
income information from Agency I, and information on 
health services from Agencies H1 and H2. Such modeling 
is possible if analysts have access to the microdata and if 
unique, common identifiers are available (e.g., Oh and 
Scheuren 1975; Jabine and Scheuren 1986). If the only 
common identifiers are error-prone or nonunique or both, 
then probabilistic matching techniques (e.g., Newcombe, 
Kennedy, Axford and James 1959, Fellegi and Sunter 1969) 
are needed. 


1.2 Relation to Earlier Work 


In earlier work (Scheuren and Winkler 1993), we 
provided theory showing that elementary regression 
analyses could be accurately adjusted for matching error, 
employing knowledge of the quality of the matching. In 
that work we relied heavily on an error-rate estimation 
procedure of Belin and Rubin (1995). In later research e.g., 
(Winkler and Scheuren 1995, 1996), we showed that we 
could make further improvements by using noncommon 
quantitative data from the two files to improve matching 


and adjust statistical analyses for matching error. The main 
requirement — even in heretofore seemingly impossible 
situations — was that there exist a reasonable model for the 
relationships among the noncommon quantitative data. In 
the empirical example of this paper, we use data for which 
a very small subset of pairs can be accurately matched using 
name and address information only and for which the 
noncommon quantitative data is at least moderately 
correlated. In other situations, researchers might have a 
small microdata set that accurately represents relationships 
of noncommon data across a set of large administrative files 
or they might just have a reasonable guess at what the 
relationships among the noncommon data are. We are not 
sure, but conjecture that, with a reasonable starting point, 
the methods discussed here will succeed often enough to be 
of general value. 


1.3 Basic Approach 


The intuitive underpinnings of our methods are based on 
now well-known probabilistic record linkage (RL) and 
edit/imputation (EI) technologies. The ideas of modern RL 
were introduced by Newcombe (Newcombe ef al. 1959) 
and mathematically formalized by Fellegi and Sunter 
(1969). Recent methods are described in Winkler (1994, 
1995). EI has traditionally been used to clean up erroneous 
data in files. The most pertinent methods are based on the 
EI model of Fellegi and Holt (1976). 

To adjust a statistical analysis for matching error, we 
employ a four-step recursive approach that is very powerful. 
We begin with an enhanced RL approach (e.g., Winkler 
1994, Belin and Rubin 1995) to delineate a subset of pairs 
of records in which the matching error rate is estimated to 
be very low. We perform a regression analysis, RA, on the 
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low-error-rate linked records and partially adjust the 
regression model on the remainder of the pairs by applying 
previous methods (Scheuren and Winkler 1993). Then, we 
refine the EI model using traditional outlier-detection 
methods to edit and impute outliers in the remainder of the 
linked pairs. Another regression analysis (RA) is done and 
this time the results are fed back into the linkage step so that 
the RL step can be improved (and so on). The cycle 
continues until the analytic results desired cease to change. 
Schematically, these analytic linking methods take the form 


ARA™ 
RL@ RAG El 


1.4 Structure of What Follows 


Beginning with this introduction, the paper is divided 
into five sections. In the second section, we undertake a 
short review of Edit/Imputation (EI) and Record Linkage 
(RL) methods. Our purpose is not to describe them in detail 
but simply to set the stage for the present application. 
Because Regression Analysis (RA) is so well known, our 
treatment of it is covered only in the particular simulated 
application (Section 3). The intent of these simulations is to 
use matching scenarios that are more difficult than what 
most linkers typically encounter. Simultaneously, we 
employ quantitative data that is both easy to understand but 
hard to use in matching. In the fourth section, we present 
results. The final section consists of some conclusions and 
areas for future study. 


2. EI AND RL METHODS REVIEWED 


2.1 Edit/Imputation 


Methods of editing microdata have traditionally dealt 
with logical inconsistencies in data bases. Software 
consisted of if-then-else rules that were data-base-specific 
and very difficult to maintain or modify, so as to keep 
current. Imputation methods were part of the set of 
if-then-else rules and could yield revised records that still 
failed edits. In a major theoretical advance that broke with 
prior statistical methods, Fellegi and Holt (1976) introduced 
operations-research-based methods that both provided a 
means of checking the logical consistency of an edit system 
and assured that an edit-failing record could always be 
updated with imputed values, so that the revised record 
satisfies all edits. An additional advantage of Fellegi and 
Holt (1976) systems is that their edit methods tie directly 
with current methods of imputing microdata (e.g., Little and 
Rubin 1987). 

Although we will only consider continuous data in this 
paper, EI techniques also hold for discrete data and 
combinations of discrete and continuous data. In any event, 
suppose we have continuous data. In this case a collection 
of edits might consist of rules for each record of the form 


ex <Y< ay, @ 
In words, 


Y can be expected to be greater than c,X and less 
than c, X; hence, if Y less than CX and greater 
than c,X, then the data record should be reviewed 
(with resource and other practical considerations 
determining the actual bounds used). 


Here Y may be total wages, X the number of employees, 
and c, and c, constants such that c,<c,. When an (X, Y) 
pair associated with a record fails an edit, we may replace, 
say, Y with an estimate (or prediction). 


2.2 Record Linkage 


A record linkage process attempts to classify pairs in a 
product space A x B from two files A and B into M, the set 
of true links, and U, the set of true nonlinks. Making 
rigorous concepts introduced by Newcombe (e.g., 
Newcombe et al. 1959; Newcombe, Fair and Lalonde 
1992), Fellegi and Sunter (1969) considered ratios R of 
probabilities of the form 


R = Pr((yeT | M)/Pr((yeI | UV) 


where y is an arbitrary agreement pattern in a comparison 
space I’. For instance, I might consist of eight patterns 
representing simple agreement or not on surname, first 
name, and age. Alternatively, each y¢I might additionally 
account for the relative frequency with which specific 
surnames, such as Scheuren or Winkler, occur. The fields 
compared (surname, first name, age) are called matching 
variables. The decision rule is given by 


If R > Upper, then designate pair as a link. 


If Lower < R < Upper, then designate pair as a possible 
link and hold for clerical review. 


If R < Lower, then designate pair as a nonlink. 


Fellegi and Sunter (1969) showed that this decision rule 
is optimal in the sense that for any pair of fixed bounds on 
R, the middle region is minimized over all decision rules on 
the same comparison space I. The cutoff thresholds, Upper 
and Lower, are determined by the error bounds. We call the 
ratio R or any monotonely increasing transformation of it 
(typically a logarithm) a matching weight or total agree- 
ment weight. 

With the availability of inexpensive computing power, 
there has been an outpouring of new work on record 
linkage techniques (e.g., Jaro 1989, Newcombe, ef al. 1992, 
Winkler 1994, 1995). The new computer-intensive methods 
reduce, or even sometimes eliminate, the need for clerical 
review when name, address, and other information used in 
matching is of reasonable quality. The proceedings from a 
recently concluded international conference on record 
linkage showcase these ideas and might be the best single 
reference (Alvey and Jamerson 1997). 
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3. SIMULATION SETTING 


3.1 Matching Scenarios 


For our simulations, we considered a scenario in which 
matches are virtually indistinguishable from nonmatches. 
In our earlier work (Scheuren and Winkler 1993), we 
considered three matching scenarios in which matches are 
more easily distinguished from nonmatches than in the 
scenario of the present paper. 

In both papers, the basic idea is to generate data having 
known distributional properties, adjoin the data to two files 
that would be matched, and then to evaluate the effect of 
increasing amounts of matching error on analyses. Because 
the methods of this paper work better than what we did 
earlier, we only consider a matching scenario that we label 
“Second Poor,” because it is more difficult than the poor 
(most difficult) scenario we considered previously. 

We started here with two population files (sizes 12,000 
and 15,000), each having good matching information and 
for which true match status was known. Three settings were 
examined: high, medium and low — depending on the extent 
to which the smaller file had cases also included in the 
larger file. In the high file inclusion situation, about 10,000 
cases are on both files for a file inclusion or intersection 
rate on the smaller or base file of about 83%. In the 
medium file intersection situation, we took a sample of one 
file so that the intersection of the two files being matched 
was approximately 25%. In the low file intersection 
situation, we took samples of both files so that the 
intersection of the files being matched was approximately 
5%. The number of intersecting cases, obviously, bounds 
the number of true matches that can be found. 

We then generated quantitative data with known 
distributional properties and adjoined the data to the files. 
These variations are described below and displayed in 
Figure 1 where we show the poor scenario (labeled “first 
poor’) of our previous 1993 paper and the “second poor” 
scenario used in this paper. In the figure, the match weight, 
the logarithm of R, is plotted on the horizontal axis with the 
frequency, also expressed in logs, plotted on the vertical 
axis. Matches (or true links) appear as asterisks (*), while 
nonmatches (or true nonlinks) appear as small circles (0). 


3.2 “First Poor Scenario” (Figure 1a) 


The first poor matching scenario consisted of using last 
name, first name, one address variation, and age. Minor 
typographical errors were introduced independently into 
one fifth of the last names and one third of the first names 
in one of the files. Moderately severe typographical errors 
were made independently in one fourth of the addresses of 
the same file. Matching probabilities were chosen that 
deviated substantially from optimal. The intent was for the 
links to be made in a manner that a practitioner might 
choose after gaining only a little experience. The situation 
is analogous to that of using administrative lists of 
individuals where information used in matching is of poor 
quality. The true mismatch rate here was 10.1%. 
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3.3 “Second Poor” Scenario (Figure 1b) 


The second poor matching scenario consisted of using 
last name, first name, and one address variation. Minor 
typographical errors were introduced independently into 
one third of the last names and one third of the first names 
in one of the files. Severe typographical errors were made 
in one fourth of the addresses in the same file. Matching 
probabilities were chosen that deviated substantially from 
optimal. The intent was to represent situations that often 
occur with lists of businesses in which the linker has little 
control over the quality of the lists. Name information — a 
key identifying characteristic — is often very difficult to 
compare effectively with business lists. The true mismatch 
rate was 14.6%. 


3.4 Summary of Matching Scenarios 


Clearly, depending on the scenario, our ability to 
distinguish between true links and true nonlinks differs 
significantly. With the first poor scenario, the overlap, 
shown visually between the log-frequency-versus-weight 
curves, is substantial (Figure 1a); and, with the second poor 
scheme, the overlap of the log-frequency-versus-weight 
curves is almost total (Figure 1b). In the earlier work, we 
showed that our theoretical adjustment procedure worked 
well using the known true match rates in our data sets. For 
situations where the curves of true links and true nonlinks 
were reasonably well separated, we accurately estimated 
error rates via a procedure of Belin and Rubin (1995) and 
our procedure could be used in practice. In the poor 
matching scenario of that paper (first poor scenario of this 
paper), the Belin-Rubin procedure was unable to provide 
accurate estimates of error rates but our theoretical 
adjustment procedure still worked well. This indicated that 
we either had to find an enhancement to the Belin-Rubin 
procedures or to develop methods that used more of the 
available data. (That conclusion, incidentally, from our 
earlier workled, after some false starts, to the present 
approach.) 


3.5 Quantitative Scenarios 


Having specified the above linkage situations, we used 
SAS to generate ordinary least squares data under the 
model Y=6X+e. The X values were chosen to be 
uniformly distributed between 1 and 101. The error terms, 
are normal and homoscedastic with variances 13,000, 
36,000, and 125,000, respectively. The resulting regressions 
of Yon. Xhave R? values in the true matched population of 
70%, 47%, and 20%, respectively. Matching with 
quantitative data is difficult because, for each record in one 
file, there are hundreds of records having quantitative 
values that are close to the record that is a true match. To 
make modeling and analysis even more difficult in the high 
file overlap scenario, we used all false matches and only 5% 
of the true matches; in the medium file overlap scenario, we 
used all false matches and only 25% of true matches. (Note: 
Here to heighten the visual effect, we have introduced 


ee ” 


another random sampling step, so the reader can “see 
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Figure la. 1 Poor Matching Scenario 
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better in the figures the effect of bad matching. This sample 
depends on the match status of the case and is confined only 
to those cases that were matched, whether correctly or 
falsely.) 

A crucial practical assumption for the work of this paper 
is that analysts are able to produce a reasonable model 
(guesstimate) for the relationships between the noncommon 
quantitative items. For the initial modeling in the empirical 
example of this paper, we use the subset of pairs for which 
matching weight is high and the error-rate is low. Thus, the 
number of false matches in the subset is kept to a minimum. 
Although neither the procedure of Belin and Rubin (1995) 
nor an alternative procedure of Winkler (1994), that 
requires an ad hoc intervention, could be used to estimate 
error rates, we believe it is possible for an experienced 
matcher to pick out a low-error-rate set of pairs even in the 
second poor scenario. 


4. SIMULATION RESULTS 


Most of this Section is devoted to presenting graphs and 
results of the overall process for the second poor scenario, 
where the R* value is moderate, and the intersection 
between the two files is high. These results best illustrate 
the procedures of this paper. At the end of the Section (in 
subsection 4.8), we summarize results over all R* situa- 
tions and all overlaps. To make the modeling more difficult 
and show the power of the analytic linking methods, we 
use all false matches and a random sample of only 5% of 
the true matches. We only consider pairs having matching 
weight above a lower bound that we determine based on 
analytic considerations and experience. For the pairs of our 
analysis, the restriction causes the number of false matches 
to significantly exceed the number of true matches. (Again, 
this is done to heighten the visual effect of matching 
failures and to make the problem even more difficult.) 

To illustrate the data situation and the modeling 
approach, we provide triples of plots. The first plot in the 
triple shows the true data situation as if each record in one 
file was linked with its true corresponding record in the 
other file. The quantitative data pairs correspond to the 
truth. In the second plot, we show the observed data. 
Where many of the pairs are in error because they 
correspond to false matches. To get to the third plot in the 
triple, we model using a small number of pairs (approxi- 
mately 100) and then replace outliers with pairs in which 
the observed Y-value is replaced with a predicted Y-value. 


4.1 Initial True Regression Relationship 


In Figure 2a, the actual true regression relationship and 
related scatterplot are shown, for one of our simulations, as 
they would appear if there were no matching errors. In this 
figure and the remaining ones, the true regression line is 
always given for reference. Finally, the true population 
slope or beta coefficient (at 5.85) and the R* value (at 43%) 
are provided for the data (sample of pairs) being displayed. 
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4.2 Regression After Initial RL>RA Step 


In Figure 2b, we are looking at the regression on the 
actual observed links — not what should have happened in 
a perfect world but what did happen in a very imperfect 
one. Unsurprisingly, we see only a weak regression rela- 
tionship between Y and X. The observed slope or beta 
coefficient differs greatly from its true value (2.47 v. 5.85). 
The fit measure is similarly affected — falling to 7% from 
43%. 


4.3 Regression After First Combined 
RL-RA-~EI-RA Step 


Figure 2c completes our display of the first cycle of the 
iterative process we are employing. Here we have edited 
the data in the plot displayed as follows. First, using just 
the 99 cases with a match weight of 3.00 or larger, an 
attempt was made to improve the poor results given in 
Figure 2b. Using this provisional fit, predicted values were 
obtained for all the matched cases; then outliers with 
residuals of 460 or more were removed and the regression 
refit on the remaining pairs. This new equation, used in 
Figure 2c, was essentially Y = 4.78_X + €, with a variance of 
40,000. Using our earlier approach (Scheuren and Winkler 
1993), a further adjustment was made in the estimated beta 
coefficient from 4.78 to 5.4. If a pair of matched records 
yielded an outlier, then predicted values (not shown) using 
the equation Y = 5.4.X were imputed. If a pair does not 
yield an outlier, then the observed value was used as the 
predicted value. 


4.4 Second True Reference Regression 


Figure 3a displays a scatterplot of X and Y as they would 
appear if they could be true matches based on a second RL 
step. Note here that we have a somewhat different set of 
linked pairs this time from earlier, because we have used 
the regression results to help in the linkage. In particular, 
the second RL step employed the predicted Y values as 
determined above; hence it had more information on which 
to base a linkage. This meant that a different group of 
linked records was available after the second RL step. 
Since a considerably better link was obtained, there were 
fewer false matches; hence our sample of all false matches 
and 5% of the true matches dropped from 1,104 in Figures 
2a through 2c to 650 for Figures 3a through 3c. In this 
second iteration, the true slope or beta coefficient and the R? 
values remained, though, virtually identical for the 
estimated slope (5.85 v. 5.91) and fit (43% v. 48%). 


4.5 Regression After Second RL-RA Step 


In Figure 3b, we see a considerable improvement in the 
relationship between Y and X using the actual observed 
links after the second RL step. The estimated slope has 
risen from 2.47 initially to 4.75 here. Still too small but 
much improved. The fit has been similarly affected, rising 
from 7% to 33%. 
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4.6 Regression After Second Combined 
RL-RA-EI-RA Step 


Figure 3c completes the display of the second cycle of 
our iterative process. Here we have edited the data as 
follows. Using the fit (from subsection 4.5), another set of 
predicted values was obtained for all the matched cases (as 
in subsection 4.3). This new equation was essentially 
Y =5.26X +, with a variance of about 35,000. If a pair of 
matched records yields an outlier, then predicted values 
using the equation Y = 5.3.X were imputed. If a pair does 
not yield an outlier, then the observed value was used as the 
predicted value. 


4.7 Additional Iterations 


While we did not show it in this paper, we did iterate 
through a third matching pass. The beta coefficient, after 
adjustment, did not change much. We do not conclude from 
this that asymptotic unbiasedness exists; rather that the 
method, as it has evolved so far, has a positive benefit and 
that this benefit may be quickly reached. 


4.8 Further Results 


Our further results are of two kinds. We looked first at 
what happened in the medium R? scenario (i.e., R? equal 
to .47) for the medium- and low- file intersection situations. 
We further looked at the cases when R* was higher (at .70) 
or lower (at .20). For the medium R? scenario and low 
intersection case the matching was somewhat easier. This 
occurs because there were significantly fewer false-match 
candidates and we could more easily separate true matches 
from false matches. For the high R? scenarios, the 
modeling and matching were also more straightforward 
than they were for the medium R * scenario. Hence, there 
were no new issues there either. 

On the other hand, for the low R? scenario, no matter 
what degree of file intersection existed, we were unable to 
distinguish true matches from false matches, even with the 
improved methods we are using. The reason for this, we 
believe, is that there are many outliers associated with the 
true matches. We can no longer assume, therefore, that a 
moderately higher percentage of the outliers in the 
regression model are due to false matches. In fact, with each 
true match that is associated with an outlier Y-value, there 
may be many false matches that have Y-values that are 
closer to the predicted Y-value than the true match. 


5. COMMENTS AND FUTURE STUDY 


5.1 Overall Summary 


In this paper, we have looked at a very restricted analysis 
setting: a simple regression of one quantitative dependent 
variable from one file matched to a single quantitative 
independent variable from another file. This standard 
analysis was, however, approached in a very nonstandard 
setting. The matching scenarios, in fact, were quite 


challenging. Indeed, just a few years ago, we might have 
said that the “second poor” matching scenario appeared 
hopeless. 

On the other hand, as discussed below, there are many 
loose ends. Hence, the demonstration given here can be 
considered, quite rightly in our view, as a limited 
accomplishment. But make no mistake about it, we are 
doing something entirely new. In past record linkage 
applications, there was a clear separation between the 
identifying data and the analysis data. Here, we have used 
a regression analysis to improve the linkage and the 
improved linkage to improve the analysis and so on. 

Earlier, in our 1993 paper, we advocated that there be a 
unified approach between the linkage and the analysis. At 
that point, though, we were only ready to propose that the 
linkage probabilities be used in the analysis to correct for 
the failures to complete the matching step satisfactorily. 
This paper is the first to propose a completely unified 
methodology and to demonstrate how it might be carried 
out. 


5.2 Planned Application 


We expect that the first applications of our new methods 
will be with large business data bases. In such situations, 
noncommon quantitative data are often moderately or 
highly correlated and the quantitative variables (both 
predicted and observed) can have great distinguishing 
power for linkage, especially when combined with name 
information and geographic information, such as a postal 
(e.g., ZIP) code. 

A second observation is also worth making about our 
results. The work done here points strongly to the need to 
improve some of the now routine practices for protecting 
public use files from reidentification. In fact, it turns out 
that in some settings — even after quantitative data have 
been confidentiality protected (by conventional methods) 
and without any directly identifying variables present — the 
methods in this paper can be successful in reidentifying a 
substantial fraction of records thought to be reasonably 
secure from this risk (as predicted in Scheuren 1995). For 
examples, see Winkler (1997). 


5.3 Expected Extensions 


What happens when our results are generalized to the 
multiple regression case? We are working on this now and 
results are starting to emerge which have given us insight 
into where further research is required. We speculate that 
the degree of underlying association R* will continue to be 
the dominant element in whether a usable analysis is 
possible. 

There is also the case of multivariate regression. This 
problem is harder and will be more of a challenge. Simple 
multivariate extensions of the univariate comparison of Y 
values in this paper have not worked as well as we would 
like. For this setting, perhaps, variants and extensions of 
Little and Rubin (1987, Chapters 6 and 8) will prove to be 
a good starting point 
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5.4 “Limited Accomplishment” 


Until now an analysis based on the second poor scenario 
would not have been even remotely sensible. For this reason 
alone we should be happy with our results. A closer 
examination, though, shows a number of places where the 
approach demonstrated is weaker than it needs to be or 
simply unfinished. For those who want theorems proven, 
this may be a particularly strong sentiment. For example, a 
convergence proof is among the important loose ends to be 
dealt with, even in the simple regression setting. A practical 
demonstration of our approach with more than two matched 
files also is necessary, albeit this appears to be more 
straightforward. 


5.5 Guiding Practice 


We have no ready advise for those who may attempt 
what we have done. Our own experience, at this point, is 
insufficient for us to offer ideas on how to guide practice, 
except the usual extra caution that goes with any new 
application. Maybe, after our own efforts and those of 
others have matured, we can offer more. 
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